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Abstract — For a stationary additive Gaussian-noise channel 
with a rational noise power spectrum of a finite-order L, 
we derive two new results for the feedback capacity under 
an average channel input power constraint. First, we show 
that a very simple feedback-dependent Gauss-Markov source 
achieves the feedback capacity, and that Kalman-Bucy filtering 
is optimal for processing the feedback. Based on these results, 
we develop a new method for optimizing the channel inputs for 
achieving the Cover-Pombra block-length-n feedback capacity 
by using a dynamic programming approach that decomposes the 
computation into n sequentially identical optimization problems 
where each stage involves optimizing 0(L 2 ) variables. Second, 
we derive the explicit maximal information rate for stationary 
feedback-dependent sources. In general, evaluating the maximal 
information rate for stationary sources requires solving only a 
few equations by simple non-linear programming. For first-order 
autoregressive and/or moving average (ARMA) noise channels, 
this optimization admits a closed form maximal information rate 
formula. The maximal information rate for stationary sources is a 
lower bound on the feedback capacity, and it equals the feedback 
capacity if the long-standing conjecture, that stationary sources 
achieve the feedback capacity, holds. 

Index Terms — channel capacity, directed information, dynamic 
programming, feedback capacity, Gauss-Markov source, infor- 
mation rate, intersymbol interference, Kalman-Bucy filter, linear 
Gaussian noise channel, noise whitening filter 



I. Introduction 

We consider discrete-time power-constrained linear Gaus- 
sian noise channels, where the signal is corrupted by an 
additive Gaussian random process. When the channel is mem- 
oryless, i.e., when the channel is corrupted by additive white 
Gaussian noise (AWGN), Shannon [1] provided a simple 
formula for computing the feed-forward channel capacity, and 
he also proved that feedback does not increase the channel 
capacity [2]. 

For Gaussian noise channels with memory, i.e., when the 
channel noise is correlated, the feed-forward channel capacity 
can be determined by the power-spectral-density water-filling 
method [3], [4], [5], [6]. However, with noiseless feedback, 
i.e., when the transmitter knows without error all previous 
channel outputs, it has been a long-standing open problem 
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to explicitly characterize the optimal (or feedback-capacity- 
achieving) signal and thus compute the feedback channel 
capacity. Cover and Pombra [7] defined the n-block feedback 
capacity and formulated the n-block feedback capacity com- 
putation as an optimization problem. Upper and lower bounds 
on the feedback capacity were established. For example, it 
is known that the feedback capacity can never exceed the 
capacity without feedback (feed-forward capacity) by more 
than 0.5 bit/channel-use [7], or that the feedback capacity can 
never be more than double the feed-forward capacity [8], [9]. 
Somewhat tighter upper bounds can be computed by numerical 
techniques as in [10]. Butman [11] devised a feedback code 
that recursively transmits the message through the channel, 
which achieves a higher rate than the feed-forward channel 
capacity for certain linear Gaussian channels. For first-order 
autoregressive (AR) noise channels, the closed-form infor- 
mation rate obtained by Butman [11] is very close to the 
tightest upper bound and thus has been conjectured to be 
the real feedback capacity, but a rigorous proof has been 
missing. Ordentlich [12] characterized an optimal feedback 
coding scheme for moving-average Gaussian noise channels. 
Tatikonda [13] formulated the feedback channel capacity in 
terms of the directed information rate. Ihara [14] studied the 
continuous-time Gaussian noise channel with feedback. 
There are two main contributions in this work: 

1) We characterize the optimal signaling and feedback 
strategy for achieving the n-block feedback capacity of a 
power constrained linear Gaussian channel with a ratio- 
nal power spectrum. We show that the optimal source is 
a simple feedback-dependent Gauss-Markov source and 
that a Kalman-Bucy filter is optimal for processing the 
feedback. This leads to a reformulation of the problem as 
a stochastic control optimization problem. As a result, a 
new method based on dynamic programming is derived 
to optimize the source and thus compute the n-block 
feedback capacity. For computing the n-block feedback 
capacity, the new method decomposes the computation 
into n identical sequential optimization problems with 
each stage involving only 0(L 2 ) variables, where L is 
the order of the rational power spectral density of the 
noise (or channel). 

We prove that a Gauss Markov source (channel input 
process) X t of the following form (also depicted in 
Fig. Q]) is optimal 

X t = d^S t _ x -m t _ x )+ e t Z t , 

V v ' 

Kalman innovation 
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Fig. 2. A discrete-time linear Gaussian noise channel 



Fig. 1. Optimal source for achieving the feedback capacity 



where d t (a vector of dimension L) and et are prede- 
termined coefficients, vectors S_ t _ 1 and m t _ 1 are the 
channel state (of dimension L) and the channel state 
estimate (computed by the Kalman-Bucy filter in the 
feedback loop), respectively, and Z t is a zero-mean unit- 
variance Gaussian random variable that is independent 
of prior channel inputs, outputs and noise. 
We note that the Kalman-Bucy filter developed in this 
paper estimates the channel intersymbol-interference 
state St and takes a different form from the filters used 
in Schalkwijk [15] or Butman [11] which recursively es- 
timate the transmitted message, but it is closely related. 
The optimal channel input derived in this paper includes 
a recursive estimate term (the Kalman innovation) and 
a novelty term (e t Z t ), thus it could be equivalent to the 
recursive message estimating and transmitting scheme 
in [15], [11] only if the optimal value of the novelty 
term etZ t is zero, which is an open problem that we 
leave unresolved. 
2) We derive an explicit formula for the maximal in- 
formation rate achieved by (asymptotically) stationary 
feedback-dependent sources. This represents a lower 
bound on the feedback capacity. We note that it is a long- 
standing conjecture that a stationary source achieves the 
feedback capacity, and this conjecture is not proved in 
this paper and is still open. If the optimal Kalman- 
Bucy filter for processing the feedback (optimized over 
both stationary and non-stationary feedback-dependent 
sources) has a steady state, i.e., if the optimal Kalman- 
Bucy filter becomes stationary as n — » oo, the feedback 
channel capacity exists and equals the maximal infor- 
mation rate for stationary sources. 
For the case of the first order autoregressive (AR) noise 
channels, our optimal signaling scheme for achieving the 
maximal stationary-source information rate turns out to 
be the same as Butman's code [11]. 

Paper organization: We introduce the Guassian noise 
channel model in Section [TT] For convenience the problem is 
reformulated in the state-space (or state-machine) realization 
context. In Section |TTTJ the n-block feedback capacity is 
expressed in a form that is suitable for solving the opti- 
mization problem using dynamic programming techniques. In 
Section IIVI we show that Gauss-Markov sources achieve the 
rt-block feedback capacity and that a Kalman-Bucy filter is 
optimal for processing the feedback. Section [V] is devoted 
to solving the feedback capacity computation problem. A 
simple feedback-capacity-achieving signaling scheme is ex- 
plicitly characterized and a dynamic programming algorithm 
to optimize the source is presented. In Section [VTl we derive 



an explicit formula for the maximal feedback information 
rate achieved by stationary sources, which represents a lower 
bound on the feedback capacity. This maximal feedback 
information rate can be evaluated by nonlinear programming 
techniques. We solve the nonlinear programming problem 
in closed form for first-order autoregressive/moving-average 
(ARMA) channels. Section rvTTI concludes the paper. 

II. Power-Constrained Linear Gaussian Noise 
Channel Model 

Let f 6 Z denote the discrete time index, and let the random 
variable Xt denote the channel input at time t. As depicted in 
Figure [2] additive stationary Gaussian noise N t corrupts the 
channel input X t to form the channel output random variable 



Rt = X t + N t . 



(1) 



It is assumed that the power spectral density function of the 
noise process Nt is known and is denoted as Spf(ui). In its 
most general formulation, the power spectral density function 
of the Gaussian noise process can be an arbitrary nonnegative 
function defined on the interval u> G (— w, tt], such that it is 
even, i.e., Sn(w) — Sn(—w), and its power is finite 



N 



7Y 

— / SN(ui)duj < oo. 
2tt J 



(2) 



Further, it is required that the average signal power be bounded 
by a known value P from above, i.e., we have an average 
channel-input power constraint 



lim E 

n — >oo 



n 

7? ^- ' 



< P. 



(3) 



If a finite block length n is considered as in [7], then the power 
constraint becomes 



E 



1 " 



< P. 



(4) 



The Gaussian noise channel has memory when the noise is 
correlated. A correlated noise process Nt can be obtained by 
passing a white Gaussian noise process Wt through a linear 
filter defined by a constant-coefficient difference equation of 
the following form 



N t =Wt 



L 

i=i 



t-i, 



(5) 



i=i 



where W t is a white Gaussian random process with vari- 
ance Oyf. It is easy to verify that the z-domain transfer 
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Fig. 3. A linear Gaussian noise channel with a rational noise filter 
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Fig. 4. Linear Gaussian noise channel with feedback. 



function of the linear noise filter in <(3j is 



H(z) 



L 

E a i z ~ 
i=i 



1 



E ^ 



(6) 



and that the power spectral density function of the noise 
process N t is rational, i.e., 

i2 



S n (cj) = a z w \H(e> u ) 



i - E a * e 
i=i 



E 



'W 



L 

E c ; e " 



jlu 



L 

E cie» fa 



(7) 



Such a linear Gaussian noise channel with memory is depicted 
in Figure [3] 

It is known that any power spectral density can be ap- 
proximated arbitrarily closely by the rational function (|7]i if 
the model size (memory length) parameter L is chosen to be 
large enough [17]. So, restriction Q is not strong, but as we 
will see, it is very much needed in the subsequent analysis. 
Further, without loss of generality, we may assume that the 
function H(z) has all its poles and zeros inside the unit circle 
and that no poles or zeros occur at the origin z — 0. Thus, 
H(z) is a causal and stable minimum phase linear filter [17], 
and its inverse is also a causal and stable minimum phase 
linear filter. 

For the linear Gaussian noise channel, the channel capac- 
ity [1] is defined as the maximal amount of information that 
could be transmitted per channel use with an arbitrarily small 
error probability. It is well known that noiseless channel output 
feedback, i.e., error-free observation of the channel output by 
the transmitter, may increase the channel capacity [11]. Thus, 
if we denote the feed-forward channel capacity as C and the 
feedback channel capacity as C*, then C < C^. 

As illustrated in Figure [4] the transmitter knows at time t, 
without error, all previous realizations of the channel outputs 
before it forms and sends out the next channel input signal. 
The transmission starts at time t = 1, We also assume 
that prior to time t = 1, a known signal is transmitted, 
e.g., Xt = for t < 0, and thus both the transmitter and 
receiver know the prior channel outputs R°_ ^ = = 
This is equivalent to assuming that both the receiver 



and the transmitter know N° , 



and W" 



w_ 



This assumption on the channel inputs and outputs prior to 
transmission is mainly for simplifying the analysis in the state- 
space setting, which we formally introduce in Section IH-BI 



We denote the message as M and encode it into n feedback- 
dependent channel input signal variables _X"f by using a 
feedback encoder, which in its most general form is 



X t =Xt (M,R\-\n°_ c 



(8) 



The receiver tries to decode the message M. based on the 
realization of the channel output variables i?" = r". 
We next summarize some of the relevant known results. 

A. Butman 's Recursive Feedback Coding Scheme 

Butman [11] considered a very general communication 
system for colored Gaussian noise channels with or without 
feedback. The feedback considered can be noiseless or noisy. 
If we omit the noisy feedback scenario, the transmitted signal 
X t is chosen as a linear combination of the feedback and the 
Gaussian message M = 9 as 



X t = S t t 



t-i 

•E 

i=l 



0-tiRi 



(9) 



With this transmission model, Butman optimized the signaling 
for certain Gaussian noise channels with feedback. For first- 
order autoregressive noise channels, the following feedback 
scheme achieves the maximal information rate under the model 
in (0. At each time instant t > 0, the transmitter computes 
the receiver-side minimum-variance estimate of the message 



= E[9\R\- 1 =r{- 1 ] 



and transmits 



X f = 5 t 



(10) 



(11) 



By optimizing the coefficients St to maximize the information 
rate subject to the power constraint, a tight lower bound on 
the feedback capacity was obtained, which was shown to be 
greater than the feed-forward channel capacity. This result 
was generalized to higher order channels [18]. Proving that 
Butman's coding scheme (0 achieves the capacity is still an 
open problem. 

B. n-Block Feedback Capacity 

Cover and Pombra [7] introduced the (block-length-7i) feed- 
back capacity as the maximal achievable information rate 
(i.e., maximal information per channel use) for finite block 
length (or finite horizon) n, and showed that a sequence 
of codes exists that can achieve the capacity as n — > oo. 
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Since we are assuming that the noise realization n ^ is 
known both to the receiver and the transmitter prior to the 
start of transmission, we can express the Cover-Pombra n- 

block feedback capacity as = max ^7 (M; i?" \ n°_ x ), 

where the maximization is taken under a finite-horizon power 
constraint 



E 



1 

n 

t=i 



^oo = ™-oo 



< p. 



(12) 



Define the following covariance matrices 
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(n) 

(») 
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As shown in [7], the n-block feedback capacity (or maximal 
information rate) equals 



C fb(n) = max 



2n 



K 



(„) 

X+iV 



K 



Here, the maximization is taken over all channel input vari- 
ables AT™, which take the following linear form 



X, 



t-i 



V t , fori = 1,2,. 



(14) 



where bu are the coefficients that need to be optimized, and 
the random variables Vt have a Gaussian distribution whose 
covariance matrix needs to be optimized. 

The n-block feedback capacity computation problem in (fT~3T > 
is formulated for an arbitrary noise covariance matrix Kff , 
This optimization problem can be solved by methods given 
in [19]. The number of unknown variables in ( fl4] i is 0(n 2 ). 
For the feedback code in ( TBI the transmitter would need 
to remember and utilize all previous channel output realiza- 
tions = r* _1 (or the noise realizations A"* -1 = n^" 1 ) 
so as to form the next input signal X t — Xt by using the 
linear equation (fl4l . Thus, the encoding complexity grows 
with time t in general. 

In the following, we reformulate the linear Gaussian noise 
channel in Figure [4] as a state-space channel model. The state- 
space formalism permits a very simple feedback-capacity- 
achieving signaling scheme whose encoding complexity is 
constant for any time instant t > 1. For this optimal signaling 
scheme we show in this paper, the number of variables grows 
as 0(n), and the encoding complexity is fixed for all t > 1. 

C. An Equivalent State-Space Gaussian Noise Channel Model 

The filter H(z) in © is modeled as a rational filter with 
all poles and zeros inside the unit circle and no poles or 
zeros in the origin. The inverse of such a filter exists and 
is also invertible, so we may filter (without causing latency) 
the channel output by a noise whitening filter H^ 1 (z) to get 
an equivalent state-space (intersymbol interference) channel 
model with white noise (depicted in Figure |5]l 

Y(z) = H-\z)R(z) = H- 1 (z)X(z) + W(z), (15) 



transmitter 
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Fig. 5. An equivalent linear Gaussian noise channel model 



i (noiseless feedback) 




(noisy forward channel) 



^^•j Fig. 6. An L th -order LTI Gaussian noise channel with noiseless feedback 



X t + J2ciX t -i + W t -J2 a i w t-i> (16) 



1 = 1 



or equivalently 

L 

Y t ~Y,aiYt-i 
i=i 

where Wt is an independent and identically distributed 
(i.i.d.) Gaussian random process with variance a^y. 

The channel model ( fl6b . depicted in Figure [5] is a channel 
model with intersymbol interference due to the filter H~ 1 (z). 
Note that the transmitter obtains Y t by filtering the original 
channel output R t using the filter H^ 1 (z). The filters H{z) 
and H^ 1 (z) are causal, minimum phase and invertible, and 
thus do not cause any delay, see [17]. Thus, the two channel 
models depicted in Figure @] and Figure [5] could be converted 
into each other's form by filtering their channel outputs using 
rational causal delay-free filters H(z) and H~ 1 (z), respec- 
tively. Hence, the two channel models are mathematically 
equivalent and have the same feedback (or feed-forward) 
channel capacities. 

The rational filter H^ 1 (z) can be realized by shift regis- 
ters [17], and the channel model depicted in Figure [5] can 
thus be represented as a state-space (or state-machine) channel 
model with noiseless feedback. As depicted in Figure [6] we 
only need to consider a linear time-invariant (LTI) filter [17] 
observed through an additive white Gaussian noise (AWGN) 
channel, see equation ( fl31 >. The LTI filter in the channel is 
completely characterized by its order L and the two vectors 
of tap coefficients 



A T 

a= [ai,a 2 ,- ■ ■ ,a L ] , 

A r ,T 

c = [ci,c 2 , ■ ■ ■ ,c L \ , 



(17) 
(18) 



which are also coefficients of the rational filter H(z) in 
equation (0. 

Let X t be the channel input at time t whose realization 
is denoted by x t . Let Y t be the noisy channel output at 
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time t whose realization is denoted by y t (which is the filtered 
channel output R t of the linear Gaussian noise channel in 
Figure |4j. Let the vector of values stored in the shift registers 
of the LTI filter, i.e., S t = [S t {l), S t (2), . . . , S t (L)] T , be the 
channel state vector, and denote the state realization by s t = 
[s t (1), s t (2), . . . , s t (L)] T . Since it is assumed that the channel 
inputs X° m are known prior to the start of transmission at 
time t = 1, the state realization S_ = s is also known. Then, 
the channel can be described by the following assumptions: 

1) The forward channel satisfies a state-space model, i.e., 



S t = AS^+bXt 



(19) 

Y t = (a + cY S t _ x +X t + W t , (20) 



where Wt is white Gaussian noise with variance a^. 
The square matrix A and the vector b are time-invariant 
as follows 



Since the variance of the white Gaussian noise W t 
is cr^r, from equation ( f20l > we have the following con- 
ditional differential entropy of the channel output 



h(Y t \SlY{- 



h{W t ) 

ilog(27reCT^) 



(24) 



III) Since the sequences X t and St uniquely determine each 
other for any given value of sq, we can characterize the 
source distribution either in terms of channel inputs X t 

as 



A 



P 



XtSl 



.y! 



or in terms of channel states St as 



P t (s t \st\yl-')^P, 



SAS 



I I t-1 t— 1' 
t-i v t-i I s t isq ,y 1 



(25) 



(26) 



A 



«1 


a 2 . 


■ «L-1 




1 


. 











1 . 











. 


1 






A 



(21) 



For the Gaussian noise channel formulated in the state- 
space framework, the feedback capacity for a finite horizon 
n equals [7] 



The LTI filter described by the coefficient vectors a and c 
is invertible and stable. The initial state S_ = s is 
known to both the encoder and the decoder. 

2) The feedback link is noiseless and causal, i.e., the 
encoder, before sending out symbol X t , knows without 
error all previous channel outputs YJ 4 = y\ . 

3) The average channel input power is constrained by 



C fb ^ = max -I (M; F" = s ; 



n 

= max - [h (Y? \S = s a )-h (W?)] , (27) 
n 

where the maximization is over the channel input distribu- 
tion (|26| | and is subject to the average input power con- 
straint (122) . In the subsequent sections, we focus on char- 
acterizing the optimal signaling and the feedback capacity for 
the state-space channel model depicted in Figure [6] 



n 

-]TE[(X t ) 2 |£ =! ] <P, 



(22) 



where n is the total number of input symbols X" that 
are used to encode the message M.. 

From assumptions l)-3), we have the following: 

I) For any given initial channel state S_ = s , the 
sequences S_\ and X\ determine each other uniquely 
because of the linear equation ( |19) . If a state sequence 
does not conform to the channel law ( fl9] >. it is called 
invalid. We only need to consider valid state sequences 
throughout. From the definition of the state St in ( fT9l , 
we see that when the initial state so is known, the chan- 
nel input sequence X t and the channel state sequence St 
are in a 1-to-l relationship. So, the mutual information 
rate between the channel input sequence (source) and 
the channel output sequence is the same as the mutual 
information rate between the channel state sequence and 
the channel output sequence. Thus, it is valid to refer to 
the state sequence as the source sequence. 

II) Given the channel state pair S^-i = (ilt-i>ilt)> t ne 
channel output Y t is statistically independent of previous 
channel states S^ 2 and outputs Y*~ , that is 



P 



Ytsixr 



= p 



YtSt 



(tftlati). (23) 



III. n-BLOCK Feedback Capacity 

In this section, we manipulate the mutual informa- 
tion / {M ; Y" \S_ — s ) into a form that is suitable for 
evaluation and analysis. Let the source distribution induced 
by the encoder X t = X t (M,Yl~ 1 ), i.e., the set of all 
valid conditional probability density functions defined in 
or (|26"| |. be denoted by 

■P = {Pt(s t \sfr 1 ) yi- 1 ) ) t = l ) 2,...}. 



(28) 



The ?i-block feedback capacity computation problem ( fT3b 
becomes (see [7]) 



C^W = max -I (M; YT 1 \S = « 
V n 

1 



/ 



max — 

V n 



[wis = *o)- wni, (29) 



where the maximization is over the source distribution V. 
Note that in (|28| | we temporarily ignore the linear structure 
of the capacity-achieving signaling (fl4l derived by Cover and 
Pombra in [7], and consider the distribution V instead because 
we first want to show that the a Markov structure of the 
distribution V is sufficient. Consequently, we will show that a 
linear signaling scheme, taking a very simple form, is sufficient 
for achieving the n-block capacity. 
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The differential entropy of the channel noise in (l29l can be 
alternatively expressed as 



fc(W?) = 5>(Wi) 



= J2h(Y t \Yt\si,s x > = § ).(_30) 



Here, the equalities follow from the fact that the whitened 
noise sequence Wt is independent and identically dis- 
tributed (i.i.d.) and from the channel assumptions in Sec- 
tion [IFC] see (124-b . By substituting ( f30b into the expression for 
I {M ; Y" \S_ — s ) in d29i l, we arrive at several equivalent 
expressions for / (M; Y™ |5 = s ) 

I(M;Y?\S = s B ) 



E 



h(Y t \Yt-\so) - -log (2irea 2 w ) 



(31) 



= E [M^ -1 ^) - h ( Y * \y^\s!i^o)} (32) 

= E [h{Y t \Yt\s«)-h(Y t lY?- 1 ,^.^)] (33) 

t = l 

n 

= J2 T (£-i-> Y t\ Y t\s ), (34) 



where equalities in OTT ) and <T32B 1 l follow from the definition 
of mutual information; the equality in d33l comes from the 
chain rule for mutual information and the channel assumptions, 
see d24l i; and equality d34T > follows from the definition of 
mutual information. The terms in the sums OTb through ( l34l 
represent the amount of information that every channel use 
contributes to the total transmitted information. 
Now, the feedback capacity can be expressed as 



= max-I(M; Y? \S Q = s ) 



v n 
1 



= m # x ^E / (^-i' r *l r i _1 ^o), 
t=i 



(35) 



where the maximization is taken over the set V of valid 
feedback-dependent source distribution functions d28l l. see [7] 
or [13] for the proof. 

A source distribution V is called optimal if it maxi- 
mizes / (A4 ; Y™ \S_ Q = s Q ) and thus achieves the n-block 
feedback capacity in d35l l. Since the information rate 
^7 (M; Y™ \S_ Q = s ) is linearly proportional to the entropy 
rate of the channel output, see (|3T1 >, a feedback-dependent 
source is optimal if and only if it maximizes the entropy 
rate of the channel output process Y t . For a linear Gaus- 
sian noise channel, the feedback capacity is achieved by a 
Gaussian source distribution, see [7], [13]. Therefore, in the 
sequel, without loss of optimality, we only consider feedback- 
dependent Gaussian source distributions. 

Since the number of arguments for the probability density 
function (pdf) Pt(s t [s^q" 1 , J/* -1 ) increases linearly with time 
t, it is hard to directly find the optimal distribution V in (l28l i 

'The right-hand side of equation j32l is defined by Marko [20] and 
Massey [21] as the directed information from the channel state to the channel 
output. 



and thereby compute the n-block feedback capacity C fb< ' n - ) 
for large block length n. However, by working on a state- 
space channel model as defined in Section IH-CI we are able 
to significantly simplify the problem and derive a simple dy- 
namic programming method to compute the n-block feedback 
capacity. 

IV. n-BLOCK Feedback-Capacity- Achieving 
Strategy 

A. Gauss-Markov Sources Achieve the Feedback Capacity 

In the following analysis, we note that since the initial 
channel state s is known according to the channel assumption 
in Section ITl-CI for notational simplicity, we will not explicitly 
write the dependence on s when obvious. 

Theorem 1: [Feedback-dependent Gauss-Markov sources 
achieve the feedback capacity] For the power constrained 
Gaussian channel, a feedback-dependent Gauss-Markov source 
distribution (not necessarily stationary) of the following form 



V 



GM 



A 



,t = 1,2, 



•} 



(36) 



achieves the n-block feedback capacity C^^, for any block 
length n. 

Proof: We adopt the following proof strategy. As shown 
in [7], we only need to consider feedback-dependent Gaus- 
sian source distributions. We take an arbitrary feedback- 
dependent Gaussian (not necessarily Markov) source distri- 
bution V\ of the form in d28l l. From the source Vi, we 
obtain the marginal state transition probabilities, from which 
we construct a feedback-dependent Gauss-Markov source V% 
of the form in (f36b . We then show that sources V\ and 
T>2, though different in general, induce the same information 
I (A4; Yp \S_ — s ). Using this argument, we will show that 
for any optimal feedback-dependent Gaussian source distribu- 
tion that achieves the n-block feedback capacity, there exists 
a feedback-dependent Gauss-Markov source distribution (not 
necessarily stationary) that also achieves the n-block feedback 
capacity. 

Let Vi be any valid feedback-dependent Gaussian source 
distribution defined as 



V 1 ^{P t (s t \st\y t 1 - l ),t = l,2,...} 



(37) 



From the source V\, we define the sequence of conditional 
marginal pdf's Q t (s t s t _ 1 , y t 1 ' 1 ), for t = 1, 2, • • •, as follows 



/ 






J 


n>r u.kr^r 1 )/^^ £-0 





(Stk-i.tf -1 ) (38) 



•(39) 



Using the functions Qt (s t \s t _ 1 ,y t 1 ~ 1 ), we construct a 
feedback-dependent Markov (not necessarily stationary) 
source distribution V2 as 



V 2 = {Qt (s t \s t _ 1 ,y{- 1 ),t = l,2,---} 



(40) 
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The source V2 induces the following joint pdf of the channel 
states S^._i and outputs Y* 

= / p ^|^^'^i^) d ^ 2 (41) 

= /n Qr(l T |s r _ l5 y\- x ) P Yr ugjvr \sl_, ) d^ 2 . (42) 

J T—l 

We next show by induction that the joint distribution d42l 
°f £it-i and induced by the source V2 is the same as the 
one induced by the source Vi, i.e., 

P^l^id^yl ho) - PSP^te-iM M (43) 
= ] p£i^(£,yl\i»W{- 2 (44) 

J r=l 

For £ = 1, by the definition d39b of source V% we have 

Qi(&Uo)=P 1 (£ 1 |s ). (46) 

We verify the to-be-proved equality d43l for t = 1 by noting 
that 

P S|5o '- o) = Ql(£l '-°) P n|sJ (»i ko) (47) 

= P 1 (s 1 |s )P yi |^ (48) 

= p Sp Zo (&^lao)» (49) 

which directly implies 

^ o (^ y iUo) = ^ 1 |^(^m|io)- (50) 

Now, assume that the equality ( |43T > holds for up to time t — l, 
where t > 1, particularly, 



r Dt-lyt-I| S (*t_27yi 1*0) 



i2.t-2' J l 1^-0 

, t-1 



(51) 



= /lI p -(^l^" 1 >yr 1 )^ T |s ; _ 1 (yr|^_ 1 )ds*r 3 .(52) 

J T=l 

The induction step for time i is simply shown as follows 



P 



(7>2 



-1,2/1 Uo) 



= Qt(s t lii-l^Vl' 1 ) p Y t \si_ l { yt I- 

j(P 2 ) /../-) ../ i 

at-l yt-1 1 c \ 



■t~l . 



(53) 



(a) 



n p-r (fiM^uV 4 ) f Y ur J?* k; - 1) 



ds'r 2 (54) 



(6) 



Jl[Pr(s T \sl-\yl- 1 ) Py^JVr l^-l)^" 2 (55) 



= p 



(Pi) 



(56) 



where (a) is the result of substituting the definition (1391 1 for 
source V2 and the induction assumption d52l into d53~l >. and (b) 
is obtained by simplifying the expression in 



Thus, we have shown that the channel states S_ t _ 1 and 
outputs Y* induced by sources Vi and V2 have the same 
distribution. It is therefore clear that the sources V\ and V2 
induce the same information I (A4;Y{ 1 \S_ — s ) according 
to 01). 

Note that the set of channel state vector entries 
{.£7- W : 1 — * — 1 < t < t — 1} and the input sequence 
X* -1 linearly determine each other according to the channel 
law in ( fl9] l. Thus, induced by the Gaussian source V\, the 
channel state entries S_ T (i) (for 1 < i < M and 1 < r < i) 
and the symbols are jointly Gaussian. We conclude that 
the conditional pdf's Q t [s t s t _ 1 ,y*~ M constructed in ( 1391 
are also Gaussian functions and thus P2 is a feedback- 
dependent Gauss-Markov (not necessarily stationary) source. 
The Gauss-Markov source V2 also satisfies the input power 
constraint (1221 . which is obvious by the equality in d56l >. ■ 

Theorem [TJ reveals that, for any given prior channel output 
it is sufficient to utilize Markov sources to maximize 
the entropy of the channel output sequence. By Theorem [T] 
without loss of optimality, in the sequel we only consider 
feedback-dependent Gauss-Markov sources of the following 
form 



V GM ± {P t (s t \s t 



,2/i -1 ) ,t = 1,2, 



(57) 



B. The Kalman-Bucy Filter is Optimal for Processing the 
Feedback 

Definition 1: We use a t (-) as shorthand notation for the 
posterior pdf of the channel state S_ t given the prior channel 
outputs Y* = y\, that is 



(58) 



□ 



For a feedback-dependent Gauss-Markov source J> GM 7 the 
functions a t ( ) can be recursively computed by the Bayes rule 
as follows 



fott-i (v)P t (jAv,y{ 1 )P Yt \s_ t l i&t (yt \v,p)dv 
JJa t -i(v)Pt (u\v,y\~ 1 )P Yt I i g t (yt |u,tt )dudv 



(59) 



Since the function a t (-) is a Gaussian pdf, it is completely 
characterized by the conditional mean m t and conditional 
covariance matrix K f of the channel state 



m t = E 
K f = E 



2-t |^.oi y\ ] ' 



^ T I t 



(60) 
(61) 



We note that the recursion d59l , i.e., the recursive computation 
of m t and K f , can be implemented by a Kalman-Bucy 
filter [22]. 

Theorem 2: [Kalman-Bucy filter is optimal for processing 
the feedback] For the power-constrained linear Gaussian chan- 
nel, let the feedback-dependent Gauss-Markov (not necessarily 
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H(Z) 





fat- 



— ~ ^ receiver 



Kalman Filter 



Fig. 7. The optimal feedback strategy for the Linear Gaussian noise channel. 



stationary) source V2 M be denned as 

)GM A 



{P t (s t \s t _ 1 ,a t - 1 {-)),t=l,2,...} 



(62) 



where the Markov transition probability depends only on the 
posterior distribution function of the channel state at (•) instead 
of all prior channel outputs Y* . The n-block feedback 
capacity C^ 11 ' then equals 



C^") = max -I (M; Y" \S = s ) 
1 - 

= SnE J (^-i' r *l r i" 1 ^o), (63) 
t=i 

where the maximization is taken subject to an average input 
power constraint 



1 " 



(64) 



t=i 



This capacity-achieving strategy is depicted in Figure [7] □ 
Proof: We first briefly outline the proof strategy. We will 
consider two different feedback vector realizations (channel 
output histories) and y^ _1 that both induce the same 

posterior state pdf a t _x(-) at time t — 1. We will then 
apply equivalent Gauss-Markov source distributions for the 
subsequent source symbols (at times t, t+1, ...), irrespective of 
the feedback realization (y* -1 or j/^ -1 ), and we get the same 
distributions for the subsequent channel inputs and outputs, 
and thus the same transmission power and output entropy (for 
the subsequent transmissions), irrespective of the two channel 
histories. Using this result, we complete the proof using an 
inductive argument. 

Suppose that two different feedback vectors and y\~ x 
(2/i _1 7^ induce the same posterior channel state pdf 

at-i(-), that is, for any possible state value s t _ 1 — fi we 
have 

St-i (m) = Ps^i^Y^ial^yi' 1 ) 

= P s t _ 1 |s ,y*- I (^l^0'yi~ 1 ) =a *-i in) ■ (65) 



Now consider two distributions of the source S T , for r > t, 
the first distribution conditioned on y\~ , and the second con- 
ditioned on . If we let these two distributions, irrespective 
of the feedback realization or j/J -1 ), be equal to each 

other for r >t, that is, for r > t if 



Pr(§, 



\b t -i,v\ 1 ,Vt 



, (66) 



we then have 



p / n n I ~t — 1 \ 

n 

= a t -i (s t _ x )J|P T (s T | s r _i , yp 1 )^ s; 



= ^^-xlSo,^- 1 l^o- Vi 1 ) ■ ( 67 ) 

The equality in d67l ) directly implies that the entropies are 
equal 

h{Y t n \s a ,y\- 1 ) =h(Y t n \s ,y t 1 - 1 ), (68) 

and that for any t > t the transmission powers are equal 

E^) 2 !^- 1 ] =E[(X T ) 2 \s ,y\- 1 ] . (69) 

Therefore, for any t > and any constant IT > 0, the source 
distribution P T (s r | s 1 __ 1 , y\ , y\~ X ) for time t < r < n 
that maximizes the channel output entropy h (Y t n |s i?/i _1 )' 
subject to power constraint 



(70) 



when z/j 1 is the feedback vector, must also maximize the 

„--.*- i s 



entropy h (Y™ \ s , y\ 1 ) under constraint 

n 

^e^) 2 !^*- 1 ] <n, 



(71) 



when y^~ is the feedback vector. We note that IT is an 
arbitrary non-negative constant (independent of the channel 
output history, and independent of the initial state). 

According to the above analysis, for any given power 
budget IT > for transmissions at times r > t, the optimal 
source distribution P t (s t \s t _ 1 , y\~~ ) at time t that maxi- 
mizes the entropy of subsequent channel outputs depends only 
on ott-i(-). We may summarize the conclusion as follows: 

Conclusion 1: At any time instant t, for any con- 
stant non-negative power constraint II, the optimal 
distribution of the source X™ that maximizes the 
channel output entropy h (Y™\ s , y\ ) under the 
power constraint Y^r=t ^ [(^r) 2 |s , y\~ x ] < II, 
depends only on the posterior channel state distri- 
bution at-i (•), regardless of the channel output his- 
tory yl^ 1 that lead to that posterior state distribution. 
We now show that we can drop the conditioning on the 
channel output history y^ 1 when formulating the power con- 
straint in Conclusion 1. According to d29l . the maximization 
problem in (l63l l is equivalent to maximizing the channel output 
entropy, i.e., 



t=i 



-li l^i tS.0, 



arg max h (Y™ \s Q ) . 



(72) 



For any 1 < t < n, we can rewrite the channel output entropy 

as 



h(Y 1 n \s ) = h{Y^ 1 \s e ) + 

J hiYp^y^Py^iyl- 1 )^- 1 . 



(73) 
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Similarly, we can rewrite the input power as 

n t— 1 

]>>[(X T ) 2 |s ]=^E[(X T ) 2 |s ] + 

T=l T=l 

| f> [(X r ) 2 lso.2/*- 1 ] P,,.-! (j/l- 1 )^*- 1 . (74) 



For the optimal source distribution that maximizes (1721 subject 
to the power constraint (l64l i, and for any given channel output 
realization y^ -1 , we define the remaining power at time t as 

n 

A 



(75) 



Clearly IIo = nP. By definition, II t _i (y^ 1 ) is a function 
of prior channel outputs and a function of time t. We note 
that the channel inputs and outputs are jointly Gaussian [7]. 
Thus for any r > t, it is sufficient to only consider optimal 
Gaussian sources that satisfy 



e^Isq,^- 1 ] =0, 



(76) 



because otherwise one could easily verify that another 
Gaussian source defined as X T = X T — E [X T |s , y\~ x ] 
for t > t would induce the same channel output entropy ( 1731 
while consuming strictly less power than X T . Now, since 
E [X T {s^y^ 1 ] = 0, we have 



^Var^l^y*" 1 ; 



(77) 



constraint n depends only on «o ( )0 The optimal source 
now generates the channel input Xi at time t = 1, whose 



power is n 1 = E 



The channel input X\ induces 
the channel output Yj , which in turn produces a new posterior 
state distribution cii (•). The leftover power is IT! = n — 7Ti. 
Now, the optimal source at time instant t = 2 is one that 
maximizes h (Y£\ s , yi) subject to the power constraint III 
and, according to Conclusion 2, the source distribution at time 
t = 2 depends only on ai (■). This source generates X^. Ex- 
tending the inductive argument further in similar fashion, we 
conclude that at each time step t, according to Conclusion 2, 
the source distribution at time t depends only on the posterior 
state distribution a t -i (•). ■ 
An alternative proof based on dynamic programming [23] 
and Lagrange multipliers is given in Appendix 1. 

Theorem [2] suggests that, for the task of constructing the 
next signal to be transmitted, all the "knowledge" contained 
in the vector of prior channel outputs y* -1 is captured by the 
posterior pdf of the channel state at-i(-). 

So far, we have simplified the capacity-achieving feedback 
strategy such that the transmitter does not need to memorize 
the entire channel dynamics yj _1 . Instead, for forming the 
optimal signal X t (or equivalently state S_ t ) to be transmitted 
at time t, we only need to know the immediately preceding 



channel state S, 



(determined by the prior chan- 



nel inputs) and the Kalman-Bucy filter output, which is a 
Gaussian probability density function a t _i(-) characterized 
by the conditional mean m t _ 1 and the conditional covariance 
matrix K 4 _i. 



But since X T and are jointly Gaussian, the variance 

Var (X T |s ,?/' _1 ) does not depend on the realization yj -1 , 
and we can write 



EE[(X r ) 2 |, ,yr]=EE[(X T ) 2 k ] 



(78) 



or equivalently, lit i = Ht-i (y\ 1 



Hence, conditioning on 
the channel output history yj -1 , is irrelevant when formulating 
the power constraint. Directly utilizing d78l >, we can modify 
Conclusion 1 into a new (slightly relaxed) conclusion that does 
not require conditioning on the channel output history when 
formulating the power constraint. 

Conclusion 2: At any time instant t, for any con- 
stant non-negative power constraint IT, the opti- 
mal distribution of the source X\ l that maximizes 
the channel output entropy h (F t ™| s , under 
the power constraint 5Z™=t E [(^t) 2 |s , y 1 ^ 1 ] = 
Sr=*E [(^t) 2 ko] — n, depends only on the pos- 
terior channel state distribution at-i (■), regardless 
of the channel output history y' _1 that lead to that 
posterior state distribution. 

We now utilize Conclusion 2 to finalize the proof using an 
inductive argument. Let ITq = nP be the power constraint at 
the beginning of transmissions. We have already established 
in Conclusion 2 that given the power constraint n , the 
optimal source that maximizes h(Y{ l \s s} ) under the power 



C. Properties of Capacity-Achieving Channel Dynamics 

For any feedback-dependent Gauss-Markov source P^ M , 
we have the following properties for the channel states, outputs 
and posterior state distributions. 

Theorem 3: For the linear Gaussian noise channel, if the 
feedback-dependent Gauss-Markov source distribution V^ M 
is used, we have 

1) The sequence of posterior channel state distributions 
a t ( ), or the pairs of variables m t and K t , is a Markov 
random process. 

2) The sequence of the pairs («*(•)) £-t) i s a Markov 
random process. 

3) The sequence of the pairs (at(-),Y t ) is a Markov 
random process. 

4) The sequence of the triples (at{-), S_ t , Y t ) is a Markov 
random process. □ 

Proof: For the Gauss-Markov source P^ M , the recursive 
sum-product rule ( l59l > for updating the posterior pdf of the 
channel state becomes 

— JJ a t-i(lL)Pt(lL\v,a t _ 1 (-))P Yt ^ t _ 1 s t (vt l£>M)d«du 



2 

state 



Note that here ao ^/x^ = <5 (jj, — s () ^ because s is the known initial 
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and the conditional pdf of channel output y t is 

*v t ia t _i(o (vt \ a t-i(-)) = Py^Yi- 1 {yt \201yi~ 1 ) 

= JJa t -i(v)P t {u\v, a t _i(-)) P Yt \s t _ l ,s^y t \& ^) dwdv(80) 

We note that in ( T79b and ( f80b . the dependence on s and y\ 
is replaced by the dependence on att-i(-) according to Theo- 
rem 12 

The theorem follows from (|79T >. (f80T > and the Markovianity 
of the source 7>P M . ■ 



Z). Feedback-Capacity-Achieving Sources for General State- 
Machine Channels 

Theorems [U |2] and [3] also hold for more general state- 
machine (or state-space) channels other than the linear Gaus- 
sian noise channel, except that for general state-machine 
channels, the feedback-capacity-achieving source may not be 
Gaussian any more, and that the Kalman-Bucy filter is replaced 
by a discrete-time version of the Wonham filter [24] (or 
the forward sum-product recursion of the BCJR or Baum- 
Welch algorithm [25], [26], [27], [28] for finite-state machines, 
see [29]). 

The linear Gaussian noise channel in this paper has an 
average input power constraint over the whole block (or 
horizon), which makes the feedback capacity computation 
problem even more difficult than the state-machine channel 
considered in [29], where no average power constraint is 
needed and the inputs are chosen from a finite-size alphabet. 
We note that the arguments used to prove Theorems [T] [2] and [3] 
also hold for some other channel input power constraints, 
e.g., the peak input power constraint, which is beyond the 
scope of this paper and will not be elaborated on any further. 

V. ti-Block Feedback Capacity Computation 

A. Parameterizing the Feedback-Capacity-Achieving Markov 
Sources 

Lemma 1: Without loss of generality, the Gauss-Markov 
source 1 D ^ M can be expressed as 



Xt — d t S_ t _ x 



etZt + g t , 



(81) 



where Z t is a white Gaussian random process with unit- 
variance and is independent of X^ 1 and Y*^ 1 . (Here e t and 
g t are scalars, and d t is a vector of length L.) The coefficients 
d t , tt and g t are all dependent on the Gaussian pdf a t -\(), or 
alternatively on its mean m t _ 1 and covariance matrix K t _i. 
The set of coefficients {d t ,et,gt} completely determines the 
feedback-dependent Gauss-Markov source distribution 
needed in Theorem □ 
Proof: Given the channel state realization S_ t -i = s t _ 1 , 
the input xt and state s t determine each other uniquely by the 
channel state propagation rule (fl9l l. so the feedback-dependent 
Gauss-Markov source M in Theorem [2] can be equivalently 
represented as 



V, 



GM 



{Pt(x t \s t _ 1 ,a t - 1 (-)),t=l,2,...} 



Further, since the channel state propagation rule ( fl9l ) is lin- 
ear, the joint distribution of X t and S_ t -i is also Gaussian 
for any given at-i (■)• Thus, without loss of generality, 
the source 7- > „ M can t> e specified as in ( TSTb by coeffi- 
cients d t , et and g t , which depend on the Kalman-Bucy filter 
output a t _i(-). ■ 

The source parametrization in d8lT ) reveals further structure 
when compared to the source parametrization ( fT4b obtained 
by Cover and Pombra [7]. It is clear that every parametriza- 
tion dBTT l leads to an equivalent parametrization (fl4l i, but not 
vice versa. Also note that the number of parameters d t , et and 
gt in (f8TT > grows only linearly with the horizon distance n, 
while the number of parameters bit in (fl4l grows quadratic ally 
with the horizon distance n. The encoding complexity in (TBi i 
grows linearly with time in general, while the encoding 
complexity in dHTT l is constant for any time instant t > 1. 

The following lemma establishes a formula for each term in 
the information sum ( |63l and the input signal power sum ( |64| > 
in terms of the source parameters d t , et, and gt,. 

Lemma 2: For the feedback-dependent Gauss-Markov 
source dBTI ). we have 



h (Yt \§o,yi~ 



-\log 



2nea 



\v I 



1 lQ hvv + {a+c+d t ) 



K t _x(a+c+d t ) + (etY 



(83) 



(ill 



djKt^dt + iet) 2 



and 

E[(Xt) 2 \8 0) yi-\ = [<i;ilL l ,+.'//) + 

where d t , et and gt are themselves functions of m 
and K t _i. 

Proof: For the source dSTT ) in Lemma Q] using the Bayes 
rule, we can formulate the first and second order conditional 
moments of the channel input X t and channel output Y t in 
terms of d t , e t , gt, Mt-i an d K t _i as 



(84) 



t— l 
□ 



E [X t \s , y\ 1 }= djmt_i + 9t, 
E[PQ) 2 l^o,^- 1 



djm t _ 



-9t 



Y 



t Sq, y x 



(a- 



d t i K t _ 1 d t +(et) 
h-x +9t, 



(n-E^lso,^" 1 ]) |so,^ _1 



= (a + c + d t ) T K t _i (a + c + d t ) + (e t ) 2 + a^. 



(85) 

(86) 
(87) 

(88) 



Note that conditioned on s and the random variable Yt 
has a Gaussian distribution with variance as in {88), thus we 
obtain (183V Equation ( f84b is the same as 



(82) 



B. Problem Reformulation Using Lagrange Multipliers 

The feedback capacity computation problem as stated in 
Theorem [2] is a maximization problem under an inequality 
constraint. Since the n-block feedback capacity computation 
problem is a convex optimization problem [19], and since 
Slater's conditions (see [30], Section 5.2.3) for this optimiza- 
tion problem are satisfied, strong duality holds between the 
original optimization problem and its Lagrangian dual. Such 
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optimization problems can be reformulated (or solved) by 
using the method of Lagrange multipliers. 

We first define a reward function for this optimization 
problem. 

Definition 2: For an arbitrary constant 7 > 0, we define the 
reward function f2 (•) as 

n (rn t „ 1 ,K t -i,d t ,e t ,gt,j) 

= h(Y t I*,,, y*- 1 ) -1 log^ea 2 ,) - 7 E[(X t ) 2 |g Q ,^ 
information transmitted P enalt y for usin g P ower 



(a) 1 / aw + {a + c+d t ) K t ~i{a+c+d t ) + ( e t ) 
" 2 i0g aS, 



(£m-i + 9t) + gKt-idt + (e t f j , (89) 



where equality (a) follows from Lemma [2] □ 
Theorem 4: For a linear Gaussian noise channel with noise- 
less feedback, the n-block feedback capacity under the average 
input power constraint equals 



£tfb(n) _ Qib(n) ^ 



max — E 

■pGM n 



,t=l 



Q (m t _ 1) K t _i ) d t ,e t ,j () 7) 



-7^,(90) 



where 7 > and 



(91) 



are first-order Kuhn-Tucker necessary conditions for achieving 
the feedback capacity. □ 
Proof: It has already been established that the n-block 
feedback capacity computation is a convex optimization prob- 
lem [19]. It is easy to verify that Slater's conditions (see [30], 
Section 5.2.3) hold, hence the primal and the Lagrangian dual 
problems are equivalent (see Appendix 2-A and Appendix 2- 
C for a complete proof). Thus, we can focus on the dual 
Lagrangian problem to derive the optimal source distribution 
so as to achieve and compute the n-block feedback capacity. 

We consider the optimization problem in Theorem [2] i.e., 
we consider finding 



max 

•pGM 



1 



h(Y t \s^y\- 1 ) - - \og(2nea 2 w ) 



V E v t- 
t=i 

subject to 

n 

$>[(X t ) 2 |S = s ] <nP. 



, (92) 



(93) 



The Lagrangian (with Lagrange multiplier 7) is 

(?> Q GM 7) 

n r 1 

= EV' h ( Y * l&'^ _1 )-2 lo g( 2?reCT M 



-7 $>[(X t ) 2 |S = s ] -nP 
\t=i y 



(94) 



Ev 



{=1 



E 



E 

t=i 



- 7 E [(X*) 2 !^,^- 1 
^ ("lt-i,K t _i,d 4 ,e t ,3 t ,7) 



^ log (27 



-wyP 



+ n^P. 



(95) 



(96) 



The source P^ M is optimal for the horizon n if and only if 
it maximizes the Lagrangian ("P^ M , 7 ) and also satisfies 
the first-order Kuhn-Tucker necessary conditions [31], which 
are 7 > and 



7 



]TE[(A t ) 2 |£ = s ] 



Further, when the maximum in 
ity d97| > holds, we have 



- nPj = 0. (97) 
is achieved, since equal- 



pGM v a ' 



(a) 



max 

•pGM 



E E y 1 i " 



h(Y t \s ,y t 1 - 1 )--\og(27rea 2 v ) 



m&xI(M;Y; l \S = s ), 



(98) 
(99) 



where (a) is obtained by substituting d97l ) into d94l ). ■ 
Occasionally, the Lagrange multiplier 7 is called as the 
shadow price [32] for the optimization problem. As can be 
seen in Definition |2j the value of 7 determines the penalty that 
incurs for using the signal power in the objective function d90b . 
It can be verified that the shadow price 7 and the power 
constraint P are in a 1-to-l correspondence (see Corollary F 
in Appendix 2-D), and that 7 monotonically decreases with 
P (see Proposition E in Appendix 2-D). Particularly, the 
value 7 = corresponds to the case when there is no power 
constraint, or P — 00. In this paper, we are interested in a 
finite input power constraint, i.e., P < 00, thus in the sequel 
we only consider 7 > 0. 

Corollary 4.1: For any finite power budget P, the corre- 
sponding Lagrange multiplier (or shadow price) satisfies 7 > 
0. The feedback-dependent Gauss-Markov source V* that 
satisfies 



are; max — E 



E 

,t=i 



Q, ("l t _i,K t _i,d t ,e t ,g t ,7) 



(100) 



achieves the n-block feedback capacity C^ n \ where the 
power budget P satisfies the following equality 



P 



1 ™ 



t) \Sn 



(101) 
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Here, the condition in fllOlt is the optimal power configura- 
tion. □ 
Proof: For 7 > 0, the condition (|9TT l in Theorem [4] 
becomes dlQlt . Thus by Theorem [4] (see also Appendix 2- 
D) the source V* that achieves the n-block feedback capac- 
ity C^™) = CF°W(P) in © has power equal to P and 
satisfies (llOll i. ■ 
Corollary 14.11 asserts that we can substitute the inequality 
power constraint d22b by the corresponding equality, and 
for any horizon n the power budget P > corresponds 
to a shadow price 7. For any shadow price 7 > 0, the 
source determined in Corollary 14.11 is optimal for the specific 
power budget P satisfying d 101b ; see Appendix 2-D for the 
proof. In other words, the power budget P is a monotonic 
(see Proposition E in Appendix 2-D), though not explicitly 
expressible, function of 7 (see also [33], [32]). Thus, comput- 
ing the ?i-block feedback capacity C^ n ^ for a finite power 
budget < P < 00 is equivalent to solving the maximization 
problem posed in Corollary 14.11 for some positive shadow 
price 7 > (as proved in Appendix 2-D), although the 
relationship between P and 7 is not known in an explicit 
form. 

We note that the general relationship between the power 
budget P and the shadow price 7 may also be derived from [7], 
but the explicitly separable form of the objective function in 
Theorem|4]and Corollary [4J] is a new result and will be crucial 
for obtaining more explicit solutions to the n-block feedback 
capacity computation problem. It should also be mentioned 
that it was shown in [33] that the feedback capacity is concave 
in the power budget P. 

We next study and solve the maximization problem in 
Corollary 14. II for any given value of the shadow price 7 > 0. 



C. Optimal Stochastic Control Formulation 

The problem of finding the optimal source for a given value 
of 7 (see Corollary 14. j} , can be formulated as a standard 
dynamic-system stochastic control problem [23] (Vol 1, Chap- 
ter 7 and Vol 2). We describe the dynamic system as follows. 

The state of the dynamic system at each stage (or time) 
t is the posterior pdf at— i(-)> which is characterized by 
its mean m t _ 1 and covariance matrix K f _i. The control 
or (policy) for stage t is the Gauss-Markov (not necessarily 
stationary) source distribution function P t (s_ t \s t _ 1 , at-i(-)) 
characterized by the parameters d t , e* and gt (see Lemma [TJ, 
which themselves are functions of m t _ 1 and Kt— 1. The 
system disturbance at stage t is the noisy channel output Yt, 
which has the following distribution 

PY t \ a t-i{-) (Vt \a t -i(-)) = Py^Yt-x (y t |s >yi _1 ) 
= j J a t - 1 (v) P t {u\v, a t-i(-))P Yt \s i _ 1 ,s t (yt l& it ) d ^ d ^-( 1 °2) 

For this dynamic system which has at-i(-) as the state, 
Pt (s t |s t _ 1 , at_i(-)) as the control and Y t as the disturbance, 
the system equation is 

at{(jA = -j- r , (103) 

JJat-l(v)Pt(u\v,at-l(-))P n \s t _ lt s t (yt\v,u)dudv 



which can be implemented by the Kalman-Bucy filter, sym- 
bolically expressed as 

(m t ,K t ) =_F^|' K) (rn i _ 1 ,K t -x,d t ,e t ,g t ,yt) ■ (104) 
If we define the matrix Q t using A and b in ( f2Tb as 

Q t =A + bdJ, (105) 

then, we can write the component-wise Kalman-Bucy fil- 
ter (1104b more explicitly as a pair of propagation equa- 
tions [22] 

™t = f kb (nit-i^t-i, d t , e t , g t , y t ) 

= Qtm.t-i + 9tk + 

(QtK t -i{a+c+d t ) + b(e t ) 2 )(yt-(a+d t ) T m t _ 1 ) 



(a + c + d t ) K t _i (g. + c + d t ) + (e t ) + a 



(106) 



w 



and 



Kt = F™(K t - 1 ,d t ,e t ) 
= Q t K t _ 1 Qj + U T (e t ) 2 

(Q t K t _ 1 (a+ £ +^ t ) + 6 (e t ) 2 )(Q t K t _ 1 (a + c + d t ) + i, (e t ) 2 ) T 
(a + c+ d t ) T K t _ I (a + a + d t ) + ( e t ) 2 + ' 



(107) 



Here, we note that K t is a deterministic function of K t _i, d t 
and et, and can be computed offline. 

By Theorem [3] when the feedback-dependent Gauss- 
Markov source V^ M is used, the process a t _i(-) has a Markov 
property. For this dynamic system, if we define the reward for 
each stage t as f2 (m t _ 1 , K.t—1, d t , et, gt, 7), then the source 
optimization problem in Corollary 14.11 is an average-reward- 
per-stage stochastic control problem (see the average-cost- 
per-stage stochastic control problem in [23], Chapter 7 and 
Volume 2). 



D. Source Optimization and Feedback Capacity Computation 

We now describe a fairly simple dynamic-programming 
value-iteration algorithm [23] which finds, for any horizon n, 
the optimal source distribution that maximizes the informa- 
tion / (M ; Y" \ S_ = s Q ), and thus computes the n-block feed- 
back capacity C lfb ( n ) for any shadow price 7 (or equivalently 
for the corresponding power budget P). 

Algorithm 1 For Optimizing the 
Feedback-Capacity- Achieving Source Distribution 
for Any Given Shadow Price 7 > 

Initialization: For any possible value of the 
pair (m t _ 1 , K*_i ), define the terminal reward 

function as (m t ._ 1 , K^_i , 7) = 0. 
Recursions: For k = 1, 2, . . . , n, generate the optimal k- 
stage reward-to-go functions as 



,9t,l 



■/ (fe) (p t -i = K t-i>7)= max \to(uh-ii' K -t-i,d t ,et,< 

{d t ,e t -gt}^ 

+E[j( fc - 1 )(m t ,K t , 7 )]}. (108) 
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At the same time, we obtain the optimal fcth-stage 
policy PW (s t |s t _ x , at-i(-)) as defined by the 
following source coefficients 



We show by induction on k that the fc-stage reward-to-go 
function is independent of the realization of the mean m t _ 1 , 
i.e., 



d [K > (m t _ l5 K t _i, 7 ), e (fc) (m i _ 1 , K t _i, 7), 
ff^Wi.Kt-i^)} 
arg max { Q (m^, K t _ l5 d t , e t , g t , 7 



J (fc ) ( Eit _ 1 ,Kt_i,7) = J (fc) (K t _ 1)7 ), 



(118) 



-E 



and (1115b . ( 1116b and ( 1117b will be the byproducts. 
Equality (II 18b is trivially true for k — 
since j(°Mm t _ 1 , Kj_i, 7) = by definition. Now, let 
us assume that ( II 18b is true for k — 1, where fc > 1, i.e., 

J^(m t! K t ,7)]}. (109) j(fe- 1 )(m i _ 1) K i _ 1 ,7)=j( fc - 1 )(K t _ 1 ,7). (119) 



Here, in ( 1108b and J 109b . the terms m t and Kt 
are computed by the Kalman-Bucy equations (1106b 
and (11071 ). respectively. 
Optimized source: The optimal Gauss-Markov source 
distribution is 

^ M ={p t ( & |^_ 11 a t -i(.))=i >(n - H -%|£ t - 1 ,a t -i(-)) ) 

t=l,2,...,ra}, (110) 

where p("^ t+1 ) (s t |s t _ 1 , at_i(-) ) is the opti- 
mal (n — t + l)th-stage policy obtained by running 
the above iterations. 

End. 

The above value iteration algorithm determines the 
source distribution that maximizes the informa- 

tion / (A4 ; Y™ \S_ = s Q ) for any given shadow price 7 > (or 
the corresponding finite power budget P), see [23] (Ch.7 and 
Volume II). Each stage of the value iteration determines one 
set of Gauss-Markov source coefficients used for one trans- 
mission as in (II 10b . Notice that the source distribution V^ M 
in Corollary 14.11 (or equivalently the coefficients d t and et 
in LemmaQ]) has a dependence on the conditional mean m t x . 
We next show that this dependence on m t _ 1 can be dropped, 
and the optimal signal can be characterized more explicitly. 

Theorem 5: There exists a feedback-capacity-achieving 
feedback-dependent Gauss-Markov source distribution P„ M 
of the kind as given in Lemma Q] whose coefficients have the 
following form 





= it (nh- 


-i.K t _i) 


= dt(K t _i), 


(HI) 


e-t 


= et (m t _ 




= et (K t _x), 


(112) 


fit 


= 9t (m t _ 


-i,*t-i) 


= - (dt (K t _i)) T m t _ 1 . 


(113) 



That is, an input signal of the form 

X t = d£±it-i + etZ t +gt = ~ Eit-i) + et^t ,(H4) 

achieves the rt-block feedback channel capacity C lb ^ n '. Fur- 
ther, the processes K t , d t and e< are all deterministic and can 
be determined off-line before the transmission starts. □ 
Proof: It suffices to prove that, for any k > 0, the 
fcth-stage optimal policy obtained from the value iteration 
algorithm (1108b and ( 1 109b has the following structure 

d (fc) (m t _ 1) K t _ 1 , 7 ) =^ fc )(K t _i, 7 ) > (115) 
e<*> (m^, K*.!, 7) =e«(K t _ 1 , 7 ), (116) 

S « ( ffit _ 1) K t _i, 7 ) = - (d<*> (Kt-^jjVid") 



By utilizing the inductive assumption ( 11 19b , we can rewrite 
the value iteration (1108b as 

J« (7^,1^,7) 

= max |n (m t _ 1) K t _i,d t ,e i ,fli i , 7 ) 

{d t ,e t ,g t \ 



- max < 

{d t ,et,g t } 



-E 



J^" 1 ) (K t , 7 )]}(120) 

+j( fe - 1 )(K t , 7 )}. (121) 

Here, we drop the expectation operator in (1121b because the 
conditional covariance matrix K f = (K t _i, d t , e t ) is a 
(deterministic) function of Kt_i, d t and e*. We also note that 
in order to achieve the maximization in (1121b . we need to have 

9t = -d£m t _ x . (122) 

This is obvious by noting that (K t , 7) = 

(Fj£$ (K t _i, d t , e t ) , is independent of the 
source coefficient g t , and that, for any choice of coefficients 
d t and e*, the reward function ft (m t _ 1 ,Tit-i,d t ,et, gt,j) 
defined in Definition [2] is maximized by g t = —d[rn t _ 1 . 
Now, we only need to show that for any two different 



realizations m t _ 1 



and m t _ 1 of the conditional mean of 



the channel state, where m t _ 1 ^ m t _ 1 , if (d t ,et,gt) = 
(d t , et, — djm t _i^ is optimal for the pair (m t _ 1 ,VLt-\) , 
then (d t ,et,g"t) — (d t ,et, —dJrh t _ 1 S j must be optimal 
for (m t _i, Kt_i), and that 

-1, K f _i, 7) = JW (mt.i.Kt-i.T) ■ (123) 



J (fc) (m t 



These are verified by checking, for any choice of d t and et, 
the following equality 



= fm t _ 1 ,K t _ 1 ,d t ,e t ,-dJm t _i, 7 ) 

+ J( fc -!) (4^ (K t _!, d t) e*) , 7) , (124) 

which is obvious from the Definition [2] for £!(•) and the induc- 
tive assumption dl 19b . We have thus verified that dl 181 ) is true 
for every k > 0, and as a direct consequence, dl 151 ) and dl 161 ) 
follow. The equality dl 17b is validated by substituting dl 151 ) 
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into equation (fl22b . Finally, (fTTTb - (fTTTb follow from < fTT3T >- 
dl 171 ) by the optimality of the value iteration algorithm. 

The covariance matrix sequence K f can be com- 
puted recursively by Kalman-Bucy filtering, i.e., K t = 
F^ 1 (K(_i, d t (Kt_i) , et (Kt_i)), which does not depend 
on the realization of the random channel output yt- Thus, for 
any Ko, the sequence K f is deterministic, and so are d t = 
d t (K t _i) and e t = e t (K t _i). ■ 

As shown in Theorem [5] in order to achieve the n-block 
feedback capacity C ih ^ n \ we only need to consider a very 
simple channel input signal of the following form 

X t = d$S.t-i + e tZ t +gt = dJ(S_ t _ 1 - m t _ 1 )+ e t Z t ,(125) 

v « ' 

Kalman innovation 

and choose the parameters d t and e t properly. Such a 
feedback-capacity-achieving signal characterization in Theo- 
rem [5] also asserts that the center-of-gravity encoding rule 
(formulated in [16] for memoryless channels) is also optimal 
for channels with memory. The expected value of the input 
signal at the receiver's site should always be zero, i.e., 



E \Xi 



\vi l ,§o. 



0. 



(126) 



An intuitive explanation is that the mean of the signal, 
i.e., E [Xt |2/i _1 ,s ], is always known to both the transmitter 
and the receiver (i.e., it is deterministic) and thus it does not 
carry any useful information but only wastes energy. Clearly 
this waste (the mean) should be equated to zero. The amount 
of information that is carried by each symbol X t is determined 
by the conditional variance of the channel input. 

The optimal linear signaling in Theorem [5] also conforms 
with the linear characterization shown by Cover and Pom- 
bra [7], see (fl4b . We note that our new Kalman-Bucy filtering 
structure admits an encoder whose complexity does not grow 
with time, and the above filtering structure gives rise to 
a computation algorithm that breaks the n-block feedback 
capacity computation problem into n sequential stages, where 
in each stage only 0(L 2 ) variables need to be optimized (since 
the dimension of the matrix K t is L x L). 

It is interesting to note that the linear signaling form in (|125t 
has already been used as a code by Butman for autoregressive 
(AR) Gaussian noise channels, see ©-(Qj}, and equations (1) 
and (28) in [11], where Butman assumed that e t — for t > 1, 
but provided no proof that such a code is optimal. Butman 
also showed that the Kalman-Bucy filter needs to be utilized 
with his chosen code, see equation (29) in [11]. While we still 
cannot confirm that Butman's choice of parameters e t = for 
t > is optimal, we can now confirm that Butman's code, 
at least parametrically, matches the optimal solution for AR 
Gaussian noise channels. 

By Theorem [5] the source coefficients d t and et and 
the conditional covariance matrix K t are all deterministic 
and can be computed off-line. Thus, the stochastic dynamic 
programming Algorithm 1 can be simplified, and we obtain the 
following deterministic dynamic programming Algorithm 2, 
which can be executed off-line (without actually transmitting 
any signals through the channel). Note that Algorithm 1 on 
the other hand is not deterministic because the expectations 



in d 1 08t and d 1 09t need to be computed (typically by Monte 
Carlo methods). 

Algorithm 2 For Optimizing the 
Feedback-Capacity- Achieving Source Distribution 
for Any Given Shadow Price 7 > 

Initialization: For any possible value of the covariance ma- 
trix K(_i, define the terminal reward-to- go function 
as j(°)(K t _ 1)7 ) = 0. 

Recursions: For k = 1,2, ... ,n, generate the optimal k- 
stage reward-to-go functions as 

J (fe) (K t _i, 7 )= max {mm^, K t -i,d t ,e t ,-djm t _ u 7 



A 



+J^ 1 \K ul )}, (127) 

and obtain the optimal fcth-stage pol- 
icy p( fc ) ^ \s t _i, at_i(-)) as defined by 

Kt_i, 7), (K t _i, 7) j 
g max In (m t _ 1 ,Kt-i,d t ,et,-dJm t _ 1 ,"/\ 

+j( fe - 1 )(K t , 7 )}. (128) 

Here, in (11271 ) and d 128b . the terms m t and K t 
are computed by the Kalman-Bucy equations (11061 ) 
and (1107b . respectively. 
Optimized source: The optimal Gauss-Markov source 
distribution is 

^ M ={p t (s i |^_ 1 ,a t _ 1 (.))=P^- i+1) (s t |s t _ 1 ,a t _ 1 (.)), 

t = 1,2,..., n}, (129) 

where p("^*+ 1 ) (s t |s t _ 1 , a t „i(-) ) is the opti- 
mal (n — t + l)th-stage policy. 

End. 

After the source coefficients d t and et are optimized by 
running Algorithm 2, by combining TheoremHl Lemma|2]and 
Theorem [5] the power budget P is determined as 



P 



t=i 

-J2(dJ^t-id t + (et) 2 ), (130) 



and the n-block feedback channel capacity C 1 ^"' can be 
determined as 



h{Y t \Yt\^) - -log(2™r : 



1 

£<fb(n) _ _5_ 

t=l 

1 n 1 

_ _ \^ _ W I "W ) T K t - 1 (° + c+d t ) + (e t ) 

n ^ 2 5 V ^ 



t=i 



(131) 



both of which can be computed off-line (without actually 
transmitting any symbols). We note that the n-block feedback 
capacity ( 11311 ) is independent of the initial channel state s . 
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Fig. 8. The feedback capacity for a first-order linear Gaussian noise channel 
(a = 0.5, c = 0.95) 



of the parameters. One good reference point is the optimal 
stationary source that achieves 2" max . In the next section 
(Section IVH . we develop a theory for explicitly determining 
the optimal stationary feedback source (or feedback code) and 
the corresponding maximal information rate T max . 

VI. The Maximal Feedback Information Rate of 
Stationary Sources 

So far, we have derived a simple linear signaling and 
feedback strategy that can achieve the n-block feedback ca- 
pacity for any finite horizon depth n. We have also given a 
simple dynamic programming Algorithm 2 to optimize the 
signal and thus compute the n-block feedback capacity. We 
now concentrate our attention on (asymptotically) stationary 
sources and their feedback information rates corresponding 
to n — ► oo. 



E. Complexity Analysis of Algorithm 2 and Feedback Capacity 
Curves 

Directly computing the n-block feedback capacity using the 
formula in [7], see ( fT3b . involves 0(n 2 ) unknown variables. 
Here, we briefly analyze the complexity of the dynamic 
programming Algorithm 2. In Algorithm 2, the unknown 
variables are the length-L vector d t , the scalar e t and the LxL 
covariance matrix Kf for t = 1,2, ... ,n. The total number 
of free parameters is thus nn, where k = L + 1 and L 
is the channel memory order. Each iteration in Algorithm 2 
in general requires an expensive exhaustive search (which 
could be implemented by appropriately quantizing the search 
space). For reasonable channel memory orders L, reasonable 
block lengths n and reasonable numerical accuracies (as 
determined by the quantization step), Algorithm 2 can be 
easily carried out by ordinary computers. We have plotted 
the results of such a computation in Figure [8] where the 
computation takes several minutes. The figure shows that when 
the block length n increases, the n-block feedback capacity 
very quickly saturates, such that the capacity curves C^™-* 
for n > 15 reach saturation. The figure also shows the 
maximal feedback information rate X max of stationary sources, 
which we derive in Section [Vl] We also note that for large n 
the curves C' fb( -™- ) seem to be numerically indistinguishable 
from Imax, suggesting that for large n the n-block feedback 
capacities converge to the maximum information rate 

Imax achieved by stationary sources. 

It is interesting to contrast the convex programming ap- 
proach of [19] to our approach (Algorithm 2) in terms of 
complexity and accuracy. In the convex optimization approach 
of [19], the number of variables is proportional to n 2 , but the 
method gives provable complexity vs. accuracy bounds. In our 
Algorithm 2, the number of variables is linearly proportional to 
n, but the numerical accuracy and effectiveness of Algorithm 2 
depend on how we quantize the search space, i.e., how we 
quantize the possible values of parameters K t _i, m t _ 1 , d t 
and e t . We can efficiently run Algorithm 2 by choosing a 
good reference point for determining the quantization range 



A. Maximal Information Rates Achieved by Stationary 
Feedback-Dependent Sources 

As shown in Sections [TV] and [V] any achievable information 
rate can be reached by a feedback-dependent Gauss-Markov 
(not necessarily stationary) source. Thus, we only need to 
consider feedback-dependent Gauss-Markov sources in the 
form shown in Theorem [5] We first define the stationary 
feedback-dependent Gauss-Markov sources that correspond to 
the steady state of the Kalman-Bucy feedback filter. 

Definition 3: [Stationary feedback-dependent Gauss- 
Markov sources] A stationary feedback-dependent Gauss- 
Markov source is a source that induces stationary channel 
input and output processes. 

An asymptotically stationary feedback-dependent Gauss- 
Markov source, in its limit as t — > oo, induces stationary 
channel input and output processes. □ 

Lemma 3: For a stationary (or an asymptotically stationary) 
feedback-dependent Gauss-Markov source, the Kalman-Bucy 
covariance matrix K t and source coefficients d t and et con- 
verge, i.e., 

lim K t = K, (132) 

t — >oo 

lim d t = d, (133) 

t — >oo 

lim e t = e. (134) 

t — >oo 

Here, the matrix K satisfies the stationary Kalman-Bucy filter 
equation (the algebraic Riccati equation) 

K = F™(K,d,e) 
= QKQ T + bb T e 2 

(QK(a+c+d) + be 2 ) (QK (a+c+d) +be 2 ) T 
(a + c + d) T K(a + c + d) + e 2 +aly 

where Q = A + & d T (K). □ 
Proof: Since the stationary (or asymptotically stationary) 
source induces, in its limit as t — > oo, stationary channel input 
and output processes, the Kalman-Bucy filter has a steady 
state, and thus the sequences K t , d t and e t converge. 
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Equation J 1 35b is then obtained as the stationary form of the 
covariance matrix propagation equation ( 1 107b of the Kalman- 
Bucy filter. ■ 

Theorem 6: [Maximal Information Rates for Stationary 
Feedback-dependent Sources] For a power constrained linear 
Gaussian noise channel depicted in Figure [3] whose noise has 
a rational power spectral density 

i=i / V i=i J n ?.f.\ 

1 V L -' (136) 

1+J2 c t e-^ 1 + E c i eJlul 
i=i J \ i=i 

the maximal information rate achieved by (asymptotically) 
stationary feedback-dependent sources subject to the average 
input power constraint 

1 ™ 

lim -^E[(X,) 2 \S = s ] <P, (137) 

n-+oo n L — ' 



2 



t=l 



equals 
X, 



max - max- log 5 ,(138) 

d,e 2 \ a w I 



where the maximization in ( 1138b is taken under the following 
two constraints 



d T Kd + e 2 = P. 



T 2 



K = QKQ 1 +6 6 1 e 

(QK{a + c+d) + b e 2 )(QK (a+c+d) + b , 



(a + c + d) K (a • 



d) 



Mr 



Here, a = [ai, a 2 , • • • , ai] T , c = [ci,c 2 , 
matrix A and vector b are 



(139) 



(140) 



,cl] T , and 



A 



ai a 2 
1 
1 







a-L-i a L 





A 



The matrix Q is defined as Q = A + b d , and the matrix K 
is constrained to be non-negative definite. □ 
Proof: By Lemma [3j for any (asymptotically) stationary 
Gauss-Markov source, the sequences K t , d t and ej converge 
as t — » oo, so we have 

lim I lo ( ^w + {a+^+^) T ^t-i{a+c+d t ) + {e t f \ 
t^o2 0g \ a 2 w J 

= i bg f ^ + («+c+rf) T Kfe+c+rf) + e 2 \ _ 

2 V ^ / 

Combining Lemma [2] with Theorem [5] and Lemma [3] we also 
get 



UmE 

t—>-OG 



(^) 2 |5 =s ,y 1 t - 1 =y*- 1 



: limf^K t _id t + (e t ) 



(142) 



The information rate and the average channel input power 
can now be easily computed as the Cesaro means of the 
converging sequences in ( 1141b and ( 1142b . From ( 1141b and 
using equation (fJTJ and Lemma [21 the information rate for 
the (asymptotically) stationary source exists and equals 

lim -I(M;Y? \S = so) 

n^oG n 

= lim J_ Vlo (vw + ia+c+^Kt-iia+c+dHietY 
- n ™b 2 n^ ° S l a 2 w 



I , / aly + (a + c + cQ T K (a - 



= 2 log 



From ( 1142b . the average signal power equals 
1 



lim 



E 



E 



l^o — 



lim — Ey 

n — >oo 72 1 



lim — 

n — >oo 77 

d T Kd- 



[(Xt) 

E E [(*-> 

*=i 

(^K t _!d t + (e t ) 



-0' ^1 



2/i 



E 

t=i 

i „2 



(145) 



(146) 



Using the results in (11441 ) and ( 1146b , we conclude that the 
maximal information rate I max for (asymptotically) stationary 
Gauss-Markov sources exists and equals 



= max-log - 



- (a+c+d) K(a + c+d) - 



,(147) 



Mr 



where the maximization in ( 1 147b is taken under the power 
constraint 

1 n 

lim - V E \{X t f \S = s ] = d T Kd + e 2 < P, (148) 

n — »oo 77, — ' L 

t=l 

and also under the constraint that the covariance matrix K 
satisfies the stationary Kalman-Bucy equation 



K 



4b 5 (K,i,e) 



(149) 



which is explicitly expressed in (1140b . 

By Corollary 14.11 the inequality in dl48b can be substituted 
by an equality, thus proving the validity of the power con- 
straint ( |139b . ■ 

To this end, our attempts to further manipulate the expres- 
sions in Theorem [6] into a simpler analytical form have not 
produced the desired outcome because of the complexity of 
the Kalman-Bucy (algebraic Riccati) equation ( 1140b . However, 
we can readily find the solution to the non-linear programming 
problem in Theorem [6] numerically, an example of which is 
depicted in Figure [9] The asymptotic information rate given 
by Theorem [6] depicted in Figure [9] is compared to the feed- 
forward capacity computed by the water- filling method [5], 
[6], [34]. 

We note that X max is a lower bound on the feedback capacity 
C^ 5 . Figure [8] shows that Imax numerically overlaps the n- 
block feedback capacity for long block lengths. 
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Fig. 9. Maximal information rates achieved by stationary sources over a 
third order Gaussian noise channel (a = [0, 0.6, 0.4] T , c = [0.5, 0.4, 0] T ) 



Under some special circumstances, for example first-order 
Gaussian channels, we can solve the optimization problem 
stated in Theorem [6] explicitly. We consider this scenario next. 

B. Maximal Information Rates of Stationary Sources Used 
over First-Order Channels 

We consider the first-order linear-time-invariant (LTI) Gaus- 
sian noise channel, where L = 1. The channel coefficients a 
and c (where — 1 < a < 1 and — 1 < c < 1) and the channel 
state covariance K = K are all scalars. We restrict ourselves 
to the case a + c / 0, since otherwise the channel is simply 
the well studied AWGN channel. 

Theorem 7: For a first-order Gaussian noise channel char- 
acterized by coefficients a, c or equivalently by its noise power 
spectral density function 

2 (1 - ae-i") (1 - ae ju ) 



Sn(lo) 



(150) 



w '(l + ce~^)(l + ce^) 

the maximal information rate X max achieved by stationary 
feedback-dependent sources subject to the average input power 
constraint 

1 " 

lim - VE[(X t ) 2 |S = so] <P (151) 



t=l 



equals 



1 



log 1 + 



(152) 



Here, the parameter r\ is the largest positive root of the 
following 4-th order equation 



P 



-ry 4 + 2- 



P 



-V 



'w 



P 



•1 



2 \ 2 

-a h 



-2a(a+c)r]-(a+c) 2 = 0. (153) 

The feedback-dependent Gauss-Markov (not necessarily sta- 
tionary) source that solves d 152b has coefficients d = 
(a + c)/r] and e = in its steady state. □ 



Proof: By Theorem|6j the maximal information rate T max 
is determined by solving the following optimization problem 




\ + (a + c + d) 2 K + e 2 



(154) 

(155) 
(156) 



under the following constraints 

d 2 K + e 2 = P, 

K= (fl± d) 2 Ko 2 w + c 2 e 2 K + e 2 a 2 v 
(a + c + d) 2 K + e 2 + er 2 , 

By substituting the constraint ( 11551 l into the objective func- 
tion d 1 54b and noting that the function log(-) is strictly 
monotonic, the optimization problem ( 11541 ), J 155b and d 1 56b 
is equivalent to the following optimization problem 



max 



d) 2 K 



d 2 K 



(157) 



with constraints 

d 2 K + e 2 
K 2 (a + c 



P 

d) 2 



(158) 



K(e 2 



2 



a w (a + d) 



2„2\ 



c e 



e 2 a 2 w = 0. (159) 



Obviously, the optimal stationary channel state variance K 
needs to satisfy K > 0. 

The Lagrangian function for the optimization prob- 
lem ( fT371 ), ( fT58l and ( fT59l is 

C(d,e,K,\,p) = (a+c+d) 2 K-d 2 K+\(d 2 K+e 2 -P) 
+p (K 2 (a+c+d) 2 + K (e 2 + a 2 v -cj 2 w (a+d) 2 ~c 2 e 2 ) 

-e 2 o 2 w ) , (160) 

where A and p are the Lagrange multipliers. Let the first- 
order derivatives of the Lagrangian function C(d, e, K, A, p) 
be zeros, and we have the following necessary conditions for 
optimality 

2K (a+c+Xd+p (K(a+c+d)-a%- (a+d))) = 0, (161) 
2e [\+p(K(l-c 2 )-a 2 w )} = 0, (162) 
(a+c+d) 2 -d 2 + \d 2 + p(2K (a+c+d) 2 

+e 2 + a 2 v ~a 2 v (a+d) 2 -c 2 e 2 ) =0,(163) 

d 2 K + e 2 -P = 0, (164) 
K 2 (a + c+ d) 2 + K (e 2 + a 2 w - a 2 w (a + d) 2 - c 2 e 2 ) 

-e 2 <r 2 w = 0.(165) 

We next solve for the optimal values of d, e and K from the 
above equations. 

We first prove that e = is necessary for optimality. If e ^ 
0, then equation ( 1 162b is substituted by 



A + p (K (1 



'w) 



0. 



(166) 



We here sketch the proof that ( 1166b cannot hold. If dl66b holds, 
the system of equations ( 1161b , ( 1166b . ( 1163b . dl64b and ( 1165b 
can be solved analytically, and the solution takes one of the 
following two possible forms. We show that neither of the two 
forms is acceptable. 
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1) One possible form of the solution induced by (1166b is 



K 



-c 2 P 



2J1 



CO 



w 



(a + c) 2 c 2 



c 2 Pa 2 w 



-C 2 P + C 2 <7 2 



W 



(167) 
(168) 



By ( 1167b and J 168b . the values K and e 2 cannot both 
be positive. However, since K is a variance, we must 
have K > 0, and e 2 also must be positive since it is 
a square of a nonzero real number, so ( 1167b and (1 168b 
cannot be the solution. 
2) The other possible form of the solution induced by J 166b 
is 

ac)(a + cfa^y 



' w 



K 



a 

(a 2 c 3 - 2a 2 c-c-2a) + P (c 5 - 
P(c 5 - 



a 2 v {a 2 C i — 2a 2 c-c 



-2a) 
-2a)- 



-2c 3 - 
2c 3 - 



1 c 



,(169) 



.(170) 



(a + c) z (c 2 

Now, if we substitute (1169b and ( 1170b into the objective 
function (1157b , we get a strictly negative value 



(a + c + d) 2 K - d 2 K 

P(c 2 -l) 2 + a 2 w (ac-lY 
c 2 - 1 



<0. (171) 



Since we will compute a positive objective func- 
tion ( 1157b when the equality condition ( 1166b is replaced 
by e = 0, this negative value ( 1171b cannot be the 
maximum of ( 11571 ). 

Therefore, we conclude that d 1 66b cannot hold when the 

variables are optimal and that the proper necessary condition 

extracted from ( 1162b is e = 0. 

For e = 0, equations ( 1161b , ( 1163b , ( 1164b and ( 1165b can be 

solved and the solution takes the following form 

K=^, (172) 
where d satisfies the following 4-th order polynomial equation 



a-o7, r ) d 2 



,2 2 \ 

7 w) 



= 0. 



-2P (a + c)d- P(a + c) 

(173) 



Here, we note that the first and last coefficients of R(d) sat- 
isfy (jIy > and — P (a + c) 2 < 0. Thus, the polynomial R(d) 
has either 3 negative and 1 positive real roots or 1 negative 
and 1 (or 3) positive real roots. By these arguments, we can 
always select d such that d(a + c) > and get a positive 
objective function to exceed the value in ( 1171b . that is 



(a + c + df K -d 2 K >0. 

Without loss of generality, we let d be 

, a + c 
d = . 

V 



(174) 



(175) 



We substitute d 175b into ( 1173b , and get the 4-th order polyno- 
mial equation for 77 as in ( 1153b . From the previous discussion, 
we note that equation (1153b always has both positive and 
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Fig. 10. Maximal information rates for stationary feedback-dependent 
sources in first-order Gaussian noise channels 



negative real roots. We note that K 
so the information rate equals 



P/d 2 = Pr) 2 /{a + c) 2 , 



1 a^ + ia+c+dfK} 

2 & \ a 2 1 



1 



log 1- 



(i+v) 2 p 



,(176) 



which implies that the optimal value of 77 should be positive 
to maximize ( 1176b . ■ 

In Figure [10] we plot the maximal information rate 
curves X max achieved by (asymptotically) stationary sources 
for two different first-order LTI Gaussian noise channels (one 
with first-order AR noise a = 0, and the other with first-order 
ARMA noise). For each channel, we compare the feedback 
information rate computed by Theorem [7] which is of course 
the same as computed by Theorem [6] to the feed-forward 
capacity computed by the water-filling method [5], [6], [34]. 

An interesting by-product of Theorem [7] is that, with sta- 
tionary feedback-dependent sources, the maximal information 
rates for the first-order autoregressive (AR) noise channel 
(i.e., for a = and c ^ 0), e.g., the channel a = and c — 
0.95 in Figure [To] equals the well-known Butman feedback 
capacity lower bound [11]. We next establish a formal proof 
of this statement. 

Corollary 7.1: For the first order autoregressive (AR) Gaus- 
sian noise channel, where a = and c / 0, the maximal 
information rate achieved by stationary sources equals 



In 



^log(x 2 ) 



log(|x| 



where the value of \ satisfies 



X 



1 



P 



"W 



x - 



X 



(177) 



(178) 



This information rate ( 1177b is exactly the same as the achiev- 
able rate by Butman's feedback code [11], also given in 
equation □ 
Proof: By Theorem [71 the optimal stationary source has 
parameters e — and d = c/rj, where 7/ > in the steady 
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K 



state. The value of K = P/d 2 = r] 2 P/c 2 needs to satisfy the 
stationary Kalman-Bucy filtering equation d 1 35b 

rfP 

F™(K,d,e) 
^ c 2 V 



P< 



<4 + (l + r?) 2 P' 



From ( 1179b . we have 



Next, we define 



A C 

X = -■ 



(179) 



(180) 



(181) 



By substituting the definition (JT8TJ into equation (1180b . we get 
the equation J 178b for \- Further, by substituting (1 180b into 
the information rate formula ( 1152b . we get 



= log (x 2 



(182) 



'ir 



Note that Butman's information rate [11] (Corollary 17.1b 
applies only to autoregressive (AR) noise channels of the first 
order L = 1. The optimality of Butman's code has been a long- 
standing conjecture (Butman's conjecture was generalized to 
higher order channels in [18]). Our solution (Theorem [6] and 
Theorem [7} applies to both autoregressive (AR) and moving- 
average (MA) (or combined ARMA) noise processes of any 
finite order L. We now have a proof that Butman's code 
achieves the maximal information rate I max of stationary 
sources for first order AR channels. However, we still cannot 
claim that this maximal rate 2" max equals the channel capacity 
C*. Further, though numerically verifiable, we still cannot 
prove that Butman's code for higher order (L > 1) AR channel 
noise achieves X max . 

C. Sufficient Condition for the Existence of the Feedback 
Capacity 

We consider the sufficient condition for the limit = 
Hindoo C^W to exist. 

Definition 4: [Time-invariant feedback-dependent Gauss- 
Markov sources] A time-invariant feedback-dependent Gauss- 
Markov source, restricted to the optimal structure shown in 
Theorem [5] is a source whose coefficients d t and e* have 
a time-invariant dependence on the covariance matrix JCt-i- 
The time-dependence is captured by the dependence on the 
posterior channel state statistics only, i.e., 



dt = d{K t -i) 
e t — e (K t _i) 

9t = 



St ffit-l 



-d T (K t -i) m t _ v 



(183) 
(184) 
(185) 

□ 



It is interesting to note that a time-invariant source as 
defined above may in fact induce a channel input process X t 
whose parameters d t , e* and the induced process K t never 
reach a steady state as t — > oo. So, time-invariance does not 
guarantee stationarit}0. 

Corollary 4. 1 states a finite-horizon stochastic control prob- 
lem. We next consider the corresponding infinite-horizon prob- 
lem. In the next lemma, we link the Bellman equation [23] for 
this infinite-horizon problem to the feedback capacity (should 
it exist). Subsequently, we formulate a sufficient condition for 
the feedback capacity C to exist. 

Lemma 4: Let the function C(^) with argument 7 > be 
defined as 



A 1 n 

(7(7)= lim max- ) illm* i.K* 

v ' n-»oo "pGM n V — 



9i 



-fet-1,7 , (186) 



if the limit exists. Bellman's equation [23] associated 
with (1186b takes the following form 



+ ^(K, 7) = max Ifl ( m, K, c?, e, —d T m, 

d.e I V 

+7T (k, 7 )}.(187) 

Here, the function tt(K, 7) is the optimal relative reward-to- 
go function, the symbol K on the right-hand side of ( 11871 ) is 
the short notation for the Kalman-Bucy filter output, that is 



K = 4^(K,d,e) 



(188) 



Bellman's equation ( 1187b is solved by a time-invariant source 
as long as it has a solution. Further, the time-invariant source 
that solves d 1 87b also solves ( 1186b . in which case the asymp- 
totic feedback capacity exists and equals 

C lh = Una C^ n) = CM + 7P. (189) 

71 — >00 

□ 

Proof: The lemma is proved by applying the results for 
the average-cost-per-stage stochastic control problem in [23] 
(Ch. 7 and Volume II). Note that if there exists a value C{^) 
and a function tt(K, 7) which solve Bellman's equation ( 11871 ). 
the value C{pf) and the corresponding feedback-dependent 
Gauss-Markov source V^ M determined by coefficients d t = 
d(K(_x) and e t = e(K t _i) also solves the maximization 
in ( 1186b . and vice versa, see [23] (Ch. 7 and Volume II). 
The time-invariant source V2 M ^at solves Bellman's equa- 
tion ( 1187b is thus optimal for the particular choice of power 
shadow price 7. 

3 Here, we give an example of a time-invariant but non-stationary source. 
Suppose that we have a first-order system (such as the one discussed in Section 
VI-B) for which a = and c = 0.2. Let the filter coefficients et = and 
dt be time-invariant, but let dt be dependent on the value of the posterior 
state variance Kt-i as: d t = d(Kt-i) = —1 if Kt—l > O.Of, and dt = 
d(Kt-i) = — i9 if K t — 1 < O.Of. Note that dt is indeed a time-invariant 
function of Kt—i because it does not depend on time t, but rather on the 
value of Kt—l. Let the value of the posterior state variance at time t — 1 be 
Kt—i = 0.0102. Then, because of this special choice of the parameters, when 
substituted into the Riccati equation H56\ we get Kt = 0.0062. Applying 
the Riccati equation one more time, we get Kt+i = 0.0f02 = Kt— 1. So, 
the system oscillates. 
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From Theorem [4] and Theorem [5] the feedback capacity 
exists if C(j) defined in ( 1 1861 ) exists, and equation J 1 89b 
follows. ■ 

In general, ensuring that Bellman's equation ( 11871 ) has a 
solution can be very complicated, see [23] (Ch. 7 and Volume 
II). However, from Lemma |4] we know that if the solution 
exists, it must be a time-invariant Gauss-Markov source. One 
sufficient condition [23] to guarantee the existence of a time- 
invariant solution to Bellman's equation ( 1187b is that for two 
arbitrary valid covariance matrices K and K, there exists a 
time-invariant Gauss-Markov source distribution that drives 
the Kalman-Bucy filter covariance matrix from value K t — K 
to value K t+T = K within finite time r < oo. We note that 
verifying such a sufficient condition is possible only on a case 
by case basis, and a systematic analytic verification is still 
missing. 

Further, even if the optimal feedback-dependent source 
is time-invariant, the covariance matrix sequence K f may 
not converge. One way to numerically check whether the 
sequence K< converges for a given channel is to run the 
dynamic-programming Algorithm 2 (value iteration) or policy 
iteration [23] for a large block length n. However, besides 
this numerical verification procedure, there are no known 
systematic approaches to analytically handle such a problem 
for an arbitrary channel. Therefore, to make further progress 
towards finding an analytic solution to the feedback capacity 
problem, we would need to prove that K 4 converges as t — ► oo. 

By numerically running the dynamic programming Algo- 
rithm 2 for various Gaussian noise channels with large block 
lengths n, e.g., n > 100, we have always observed that the 
Kalman-Bucy filter in Figure [7] becomes stationary, i.e., K t 
converges numerically, as t becomes large. It has been a 
long-standing conjecture that stationary sources achieve the 
feedback capacity [11]. Here, we reformulate the conjecture 
in terms of the posterior state covariance matrix computed by 
the Kalman-Bucy filter. 

Conjecture 1: The optimal (feedback-capacity-achieving) 
source induces a stationary (or asymptotically stationary) 
Kalman-Bucy filter for processing the feedback, i.e., for the 
optimal source, the limit 

lira K t = K 

t — >OG 

exists. □ 
Under Conjecture Q] the feedback capacity of a power 
constrained linear Gaussian noise would be achieved by a 
(asymptotically) stationary source, and the feedback capacity 
would equal I max given in Theorem [6] 

For first-order moving-average (MA) linear Gaussian noise 
channels, Kim [35] recently proved that a uniform power 
allocation over time is asymptotically optimal by following 
a different approach from what used in this paper, and that 
the feedback capacity equals the maximal information rate 
derived in Theorem [7] for the moving-average noise subcase. 
This new result implies that stationary sources for first-order 
moving-average (MA) linear Gaussian noise channels are 
indeed optimal. 



VII. Conclusion 

We considered the problem of computing the n-block feed- 
back capacity of a Gaussian noise channel with memory under 
an average channel input power constraint. In its full gener- 
ality, the problem would consider any power spectral density 
of the Gaussian noise process. However, for technical reasons, 
we only considered noise processes that have rational power 
spectra, i.e., noise processes that are either autoregressive (AR) 
or moving average (MA) or both (ARMA). Since we were 
computing the capacity of a channel with memory, we found 
it beneficial to cast the problem in the state-space realization 
formulation, which proved to be well-suited for this problem. 

For the Gaussian noise channel with a rational power 
spectrum, we found that the n-block feedback capacity C^™-* 
is achieved by a Gauss-Markov (not necessarily stationary) 
source distribution, where the channel input depends only 
on the previous channel state and the posterior channel state 
distribution computed by a Kalman-Bucy filter. Further, we 
showed that the channel state, the posterior channel state 
distribution and the channel output jointly form a Markov 
process. 

The Markov property of the optimal source reduced the n- 
block feedback capacity computation to a standard dynamic- 
system stochastic control problem, which can be solved by 
dynamic programming. For this optimization problem, we 
found a simple structure of the optimal source, where the 
encoding complexity is constant for any time instant. We 
showed that the coefficients of the optimal Gauss-Markov 
source depend only on the covariance matrix of the posterior 
channel state estimate computed by a Kalman-Bucy filter, 
and can be optimized deterministically and off-line. The n- 
block optimization problem is thus broken into n sequential 
problems. In each sequential step, 0{L 2 ) variables need to be 
solved for, where L is the order of the ARMA channel noise. 

We note that for additive white Gaussian noise (AWGN) 
channels, retransmitting the message uncertainty [15] or trans- 
mitting the newly coded signal [1] could both achieve the 
channel capacity (the feedback capacity equals the feed- 
forward capacity). In our formulation (11251 ). it is still an open 
problem to determine if both parameters d t and e t could 
take non-zero optimal values. For the initial transmission at 
time t = 1, since s is known, the transmitter has to let et 7^ 
to start transmission. Our numerical simulation have always 
suggested that, for t — > 00, the optimal value of e t should be 
zero, but a proof is missing. 

We solved analytically the maximal feedback information 
rate X max achieved by (asymptotically) stationary sources, 
which represents a lower bound on the feedback capacity 
C . Under a Kalman-Bucy filter stationarity assumption 
(Conjecture [T]), the feedback capacity C ft would equal T max . 
Conjecture Q] is a reformulation (in terms of Kalman-Bucy fil- 
tering parameters) of a long-standing conjecture that stationary 
sources achieve the feedback capacity. 

Appendix I 
Alternative Proof of Theorem [2] 

Here we present an alternative proof of Theorem |2] based 
on dynamic programming [23] and Lagrange multipliers (see 
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Appendix 2). Consider the mixed cost: 



1 ^o)~7-E 



Ew 2 



(190) 



As discussed in Appendix 2, there is, under an optimal source, 
a one-to-one relationship between the Lagrange multiplier 
(shadow price) 7 and the resulting average power constraint P, 
and hence the optimization of the mixed cost in ( |190t yields 
the optimal source. We will now show that for any shadow 
price 7, including the 7 corresponding to the given power P, 
we can without loss of generality restrict ourselves to sources 
of the form {P t (s t | s t _ 1} a t -i(-)), t = 1,2,- ■ •}. 

As is typical of dynamic programming arguments we will 
prove this by backward induction starting at time n. At time 
n the optimization is: 



max 



This 



be 



rewritten as 

-1 \ 



^0' X l 



( fl9lT l, where A„_ x (-) = 
is the random function whose 



realizations are a n _i(-) = P g i g y „-i 

Because s is fixed and to simplify notation we will not 
explicitly condition on s in the following discussion. To 
compute the first term in the inner expectation of d 1 9 li t, we 
need access to 

P(yn,s™_i,a„_i(-) I y^ 1 ) 

= P{Vn I ^x)Pn{s n I S n _i,J/i _1 )Q!n-l(an_l)- 

To compute the second term in the inner expectation of (1191l >. 
we need access to 

p{xu 1 vr 1 ) 

P(Xn I s"_i)-Pn(s„ I S„_i,J/i _1 )an-l(« n -l)^n-l- 



Hence, in the inner expectation of ( 11911 >. the maximization 
over the choice of P n (s n \ s n _ 1 ,y"~ 1 ) requires knowing 
only s n _ 1 and o rs _i(-). Thus without loss of generality 
we can restrict the source at time n to be of the form: 
P„(s n I s n _i, a n _i(-)). Let the optimal cost-to-go [23] at 
time n, given by the inner expectation in ( 11911 l. be denoted by 
Jn(a n -i{-)). 

Now, via the induction hypothesis assume that the source 
for times r = t + 1, „., n can be chosen without loss of 
generality to be of the form {P T (s T | s T _ l5 a T _i(-)), r = 
t+1, ...,n}, and assume the cost-to-go functions J T (a T _i(-)) 
can be chosen to only depend on a T _x(-) for r = t+1, ...,n. 
The optimization at time t is given in d 1 92b where J t+ i(-) 
is the optimal cost-to-go (which by the induction hypothesis 
only depends on at). 

As in (11911 ) we can write ( 11921 i as an iterated expectation 
conditioned on Y{ . The inner expectation for the new term 
Jt+i(at(-)) takes the form 



E[j t+1 (A t (-)) I^- 1 ] 



(193) 



To compute ( 11931 . we need access to p(at( ) \y\ 1 ). Now by 
equation d59l we know that a t { ) is a function of aj_i(-), yt, 



and Pt(s t I s t _ x ,y\ 1 ). Therefore, to compute ( 11931 1. we need 
access to 

p(y t ,a t -i(-) I yl' 1 ) 

= J P(yt I it-i)Pt{s t I «*-i»»i -1 )o*-iC»t-i)d^-i- 

Hence J193I ) depends on y\~ x only through at-i and the 
choice of source Pt(s t \ s t _ 1 ,y t 1 ^ 1 ). 

This observation, along with an argument similar to the one 
given for the optimization ( 119 It , shows that the optimization 
in dl921 i can, without loss of generality, be restricted to a 
source of the form: Pt(s_ t | s t _ 1 , at-i(-))- Thus we have 
proved Theorem [2] 

Appendix II 

Strong Duality and the Relationship between the 
Power Constraint and the Shadow Price 

A. Strong Duality under Slater's Conditions 

Let / (V) be a function (not necessarily concave) of a 

variable V. Let / (V) < be a constraint on the variable 
V. Let 

V* = arg max / (V) 

V:f(V)<0 

be the solution of the constrained optimization problem, and 
let 

r=I(V*)= max J (P) 

V:f(V)<0 

be the constrained maximum. Define the Lagrangian for the 
constrained optimization problem as 

C{V, 1 ) = I{V)- 1 f (V), 

for which the Lagrangian dual function is the unconstrained 
maximum 

G( 7 )=max[/(P)- 7 /(P)]. 

From Boyd-Vandenberghe [30] Section 5.1.2, we have that 
G (7) is convex (even when / (V) is not concave), and from 
Section 5.1.3, we have 

G(7) >r =I(V*). 

Let the solution to the Lagrangian dual problem be 

7* = arg min G (7) . 

7 

Weak duality, Boyd-Vandenberghe [30] Section 5.2.2, guaran- 
tees 

G* = G(7*) >I(V*) = I*. 

Theorem (Strong duality under Slater's conditions, 
Boyd-Vandenberghe [30], Section 5.2.3) If I {V) is a 
concave function, if f(V) is a convex function, and 
if there exists a parameter V for which / (V) < is 
feasible, then strong duality I* = G* holds. 
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log 



P(Y n \gl-i) 
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yn 



-111 



(191) 



max I(SU;F, | l?" 1 , s ) - 7 • E[(X t ) 2 | s ] + E[J t+1 (A t (-)) | &,], 



(192) 



B. Propositions Involving the Feedback Capacity 

Let C*'"' (P) be the n-block feedback capacity under the 
power constraint P, and let (P) be the feed-forward n- 
block capacity under the same power constraint. 
Proposition 1 (Cover-Pombra [7]) 

(7fb(n) (p) < (p) + _. 

Proposition 2 (reformulation: water-filling theorem, 
Gallager [34], Theorem 7.5.1 ) 



i=l 



+ n + r 2 j 1- r k 

kr. 



where < 7*1 < r 2 < • • • < r n are eigenvalues of the 
n x n channel noise covariance matrix and k (< n) is 
the largest integer satisfying nP+ri+r2 + - ■ -+r k > 
kr k . 

Proposition 3 

Qfbin) (p) 



lim 

P^oo 



P 



0. 



proof: Using Propositions 1 and 2, we have 

Qfbin) (p) C (n) (p) + 1 

lim < lim — 2. 

P— oo P P^co P 



= lim 

P^oc 



1 l ncr rtP+?-i+r 2 + -- 

2n .Z^ iu 6 kn 



P 



Applying L'Hopital's rule to the right-hand side 
proves the proposition. 
Proposition 4 For any 7 > 0, 



lim 



C fb(n) (p) _ 7 p 



proof: Similar to the proof or Proposition 3. 

C. Strong Duality of the n-Block Feedback Capacity 

Define V as the source (under some proper parametrization; 
here it is convenient to choose the parametrization in Van- 
denberghe, Boyd and Wu [19] because it leads to a concave 
feedback information rate). Let J a, \ n ) (P) be the n-block 
feedback information rate achieved by the source V . Let the 
source V be subject to a power constraint P ow (P) < P. The 
primal constrained optimization problem is then stated as 

C fb(n) /p) = max 7 fb(„) , p s 
P:Pow(P)<P 

The case P = is trivial and can be dismissed. The function 
jfb(n) j s concave anc [ the power constraint is convex, see 



Vandenberghe, Boyd and Wu [19]. Further, for any P > 0, 
there exists a feasible source that satisfies P ow (V) < P (just 
take the trivial zero-source as an example). Hence, Slater's 
conditions for strong duality are satisfied and the solution of 
the primal problem equals the solution of the dual problem 
for any P > 0. 

The dual problem is formulated as follows. The Lagrangian 

is 

C {V, 7) = I Mn) (P) - 7P0W (P) + IP- 
The Lagrangian dual function is the unconstrained maximum 

G( 7 ) = max£(P, 7 ). 
The solution to the dual problem is 

7* = arg min G (7) , 

7 

and the n-block feedback capacity is 

C fb(n) (p) = q* A Q ^ = mmG ( 7 ) _ 



D. Relationship between the Power Constraint and the 
Shadow Price 

Because of the strong duality between the n-block feedback 
capacity computation problem and its dual, for any power 
constraint P > 0, the solution to the dual problem gives a 
parameter 7* such that G (•y*) is the solution to the primal 
problem. We now want to establish the backwards relationship, 
i.e., that for any chosen shadow price 7, the unconstrained 
solution to the Lagrangian maximization 



P* = arg max 
V 



><») (V) - 7 P ow (P) 



(194) 



gives a source P* whose power satisfies P ow (P) = P < 00, 
such that P* is the solution to the primal problem when P is 
the power constraint. 

First, we can easily dismiss the shadow price 7 = from 
consideration because if 7 = 0, then the solution to ( 11941 > 
clearly gives a source whose power is P = 00, i.e., the source 
is not power-constrained and can be dismissed. 

Next, we want to establish that for any power constraint 
P < 00, there exists an optimal shadow price 7* > 0, and 
vice versa, that for any shadow price 7 > 0, the solution 
to the unconstrained Lagrangian optimization ( 1194t gives a 
source whose power P is finite, P < 00. 

Proposition A For any power constraint P < 00, 
the solution to the Lagrangian dual problem satisfies 
7* > 0. 
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proof: Pick any P < oo. Then clearly, Cf°^ (P) < 
oo. Because of the strong duality, we must have 



G( 7 *) = C^ n) (P) < 



oo, 



where 



7 



arg min G (7) . 



Now, assume that 7* = 0. Then the solution to the 
dual problem delivers G( 7 *) = G (0) = 00, which 
contradicts the previously established relationship 

C fb <") (P) = G(7*) < 00. Hence, 7* > mwtf 

Proposition B For any shadow price 7 > 0, the 
solution to the unconstrained Lagrangian optimiza- 
tion problem (|194t delivers a source whose power is 
finite, i.e., P < 00. 
proof: First notice that 



max 
V 



><») (p) _ 7 p ow (75)' 



> 



because we can always pick the trivial all-zero 
source whose power and information rate are zero. 
Now, assume that there exists a 7 > such that the 
solution to \194\ delivers a source V* whose power 
is P ow (P*) = P — 00. For such a source, invoking 
Proposition 4, the following holds 



max 
V 



/fb(n) (p) _ 7 p ow (p) 
/fb(«) (p*) _ 7 P Qw (p*) 



< lim G fb ^ (P) - 7 P ow OP*) 

P^oo 



lim 

P^oo 

—00, 



C fb(n) (p) _ 7 p 



which contradicts our earlier conclusion. Thus, for 
any 7 > 0, f/ie solution to \194\ must be a source 
whose power is finite, P < 00. 
Propositions A and B jointly establish that any shadow 
price 7 > maps to a power P < 00, and vice versa. We 
now show that the solution to the unconstrained Lagrangian 
optimization ( |194t gives a source that achieves the feedback 
capacity for some P < 00. 

Proposition C For any 7 > 0, the solution to the 
unconstrained Lagrangian optimization ( 1194t gives 
a source V* with power P ow (V*) = P for which 
7* = 7 is the solution to the Lagrangian dual 
optimization. 

proof: Pick some 7 > 0. For this 7, Proposition B 
established that the solution to M94\ is a source V* 
whose power P is finite, P < 00. For such finite P, 
strong Lagrangian duality holds, so for this value of 
P, we must have 

jib(n) (p*) = C fb(n) (p) = q ( 7 *) ^ 

where 7* is a solution to the Lagrangian dual 
problem for power P. Now, if we substitute this 
source P*, whose power is P ow (P*) = P, into the 



expression for the Lagrangian dual function G (•), 
for our chosen value 7 > 0, we get 

G (7) = Z^M (P*) = (fW (P) = G (7*) . 

Since G (•) is a convex function, and since G (7*) is 
the minimum of G (•), it follows that G (7) is also 
the minimum of G (•). Therefore 7 is also a solution 
to the Lagrangian dual problem for power P. Hence 
we can set 7* = 7. 
Proposition C established that any chosen 7 > is indeed 
the solution 7* = 7 of the Lagrangian dual problem for some 
power P < 00. The following two propositions establish that 
two different power constraints Pi ^ P2 correspond to two 
different shadow prices 7* ^72- 

Proposition D If Pi ^ P 2 , then 7* ^72- 

proof: Let Pi =/= Pi- By the monotonicity and 

continuity of G^™' (P), we have 

C-fb(n) ( Pl ) C fb(n) (p 2 ) _ 

Now, assume that the respective solutions to the 
Lagrangian dual problems are equal, i.e., 7* = 7^. 
Then 

c Mn) [p 1 )=G (7*) = G ( 72 *) = (P 2 ) , 

which contradicts our earlier conclusion that 

C lb(n) ( Pi ) C fb(n) (p 2 )_ Hence 7 * ^ 

7I must 

hold. 

Proposition E If 7* < 7 |, then Pi > P 2 . 

proof: We observe that for any 7* < 7.* and 
for any V (such that V ow (P) > 0), we have the 
following inequality 

£CP,7*)>£(P, 72 *). 

We first show that 

G( 7 D = max£(P, 7l *) > msx.C(P,>fi) = G( 7 * 2 ). 
Let's assume that the contrary is true, that is 



max£ (P, 7*) < max£ (V, 7|) 



Let PZ be 



P* 2 = arg max C{P, 7 2 * ) 



Then we have 

< max C(P, 7 2 *) 

which contradicts the established inequality 
C(V,Ti)> C{Vni)forayV . 
As a result, we have 

C lh(n) (Pi) = G( 7l *) > G( 72 *) = C iHn) {P 2 ). 

By the monotonicity and continuity o/G fb ^ n ^ (P), we 
further have 

Pi > Pi- 
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Corollary F The shadow price 7 = 7* > 
and the power constraint P < 00 are in a 1-to-l 
correspondence. 

proof: It is a direct consequence of Propositions D 
and E. 
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